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BACKGROUND OF THE INVENTION 

The present invention relates generally to 
5 classifying and identifying polypeptides having similar 
structure or function based on comparative amino acid 
sequence analysis and more specifically to determining 
structure-related properties of a ligand when bound to a 
polypeptide of known amino acid sequence. 

10 Structure determination plays a central role in 

chemistry and biology due to the correlation between the 
structure of a molecule and its function. In particular, 
a three dimensional model of a therapeutic target 
polypepetide can be of valuable assistance in the design 

15 or discovery of therapeutic drugs. The structure of a 
ligand bound to a polypeptide as observed in a three 
dimensional model can be used as a template for 
identifying structural properties to be incorporated into 
candidate drugs. Alternatively, using computer assisted 

2 0 methods a candidate drug can be identified based on 
structural properties that allow docking to a binding 
site in the three dimensional model of the target 
polypeptide, much as a key fits a lock. By structure- 
based methods such as these, lead compounds can be 

25 identified for further development. 



Although methods for structure determination 
are evolving, it is currently difficult, costly and time 
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consuming to empirically determine the three dimensional 
structure of a polypeptide. In general, determining such 
structures for polypeptides complexed with ligands is 
even more difficult. One approach to circumventing this 
5 difficulty is theoretical modeling of polypeptide 

structures with or without a bound ligand based on more 
readily available structural and functional information. 
Such theoretical modeling approaches are based on the 
tenet that the three-dimensional structure and function 
10 of a polypeptide are imparted by its amino acid sequence 
and the corollary that polypeptides with similar amino 
acid sequences have similar structure and function. 

Theoretical determination of a three 
dimensional model for a polypeptide by ab initio methods 

15 is a relatively undeveloped method. However, another 
theoretical approach, referred to as homology modeling, 
has been used to infer structure for a particular 
polypeptide by threading its amino acid sequence through 
or overlaying the sequence upon a three-dimensional model 

20 of a homologous polypeptide. The successful application 
of homology modeling to determining polypeptide structure 
relies upon choosing a correct polypeptide template for 
comparison. In most cases criteria for comparison are 
unavailable or unreliable. 

25 Thus, there exists a need for efficient methods 

to identify homologous amino acid sequences and to 
identify structural or functional characteristics of a 
polypeptide based on its amino acid sequence. A need 
also exists for methods to determine ligand binding 

3 0 properties of polypeptides based on sequence information. 



The present invention satisfies these needs and provides 
related advantages as well. 

SUMMARY OF THE INVENTION 

The invention provides a method for separating 
two or more subsets of polypeptides within a set of 
polypeptides. The method includes the steps of: (a) 
determining a sequence comparison signature for each 
amino acid sequence in a set of amino acid sequences, 
wherein the sequence comparison signature includes 
pairwise comparison scores for the amino acid sequence 
compared to each of the other amino acid sequences in the 
set; (b) constructing a distance arrangement including 
the sequence comparison signatures related according to 
the distance between each of the sequence comparison 
signatures; and (c) identifying a first and second 
cluster of sequence comparison signatures in the distance 
arrangement, wherein the first cluster includes sequence 
comparison signatures for polypeptides having a similar 
protein fold or biological function, the protein fold or 
function being different compared to a protein fold or 
function of polypeptides having sequence comparison 
signatures in the second cluster. 

The invention also provides a method for 
identifying a member of a polypeptide family. The method 
includes the steps of: (a) determining a query sequence 
comparison signature for an amino acid sequence, wherein 
the query sequence comparison signature inlcudes pairwise 
comparison scores for the amino acid sequence compared to 
each amino acid sequence in a set; (b) comparing the 



4 

distance between the query sequence comparison signature 
and the sequence comparison signatures for other amino 
acid sequences in the set, wherein the sequence 
comparison signatures for other amino acid sequences in 
5 the set are clustered into polypeptide families; and (c) 
identifying a proximal cluster having one or more 
sequence comparison signatures that have a closer 
distance to the query sequence comparison signature than 
the sequence comparison signatures of a distal cluster, 
10 thereby identifying the polypeptide having the query 
se q Uence comparison signature as being a member of the 
polypeptide family for the proximal cluster. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 shows a matrix of sequence comparison 
15 scores for 15 sequences. 

Figure 2 shows clustered sequence comparison 
scores for the 15 sequences presented in Figure 1. 

Figure 3 shows a sequence comparison score for 
sequence 16 compared to the clustered sequence comparison 
20 scores presented in Figure 2. 

Figure 4 shows a multiple alignment of E. Coli 
DXPR to S. aureas homoserine dehydrogenase (1EBF_A) . 

Figure 5 shows a homology model for E. coli 
DXPR superimposed on the model of NAD+ from the X-ray 
25 crystal structure of S. aureas homoserine dehydrogenase. 



DETAILED DESCRIPTION OF THE INVENTION 



The invention provides methods for classifying 
polypeptides into groups of similar structure or function 
based on amino acid sequence similarities. The methods 
can be used to classify polypeptides from a family of 
polypeptides that bind the same ligand, into 
pharmacofamilies that bind particular conformations of 
the ligand. An advantage of the invention is that ligand 
binding properties can be identified for polypeptides in 
a database for which sequence information is readily 
available but structural and/or functional properties are 
incompletely known or unavailable. An advantage of 
classifying polypeptides according to bound conformations 
of a ligand is that a pharmacof amily is likely to contain 
polypeptides having greater binding specificity for a 
particular molecule than other polypeptides in the same 
family. Thus, the methods allow identification of a 
pharmacof amily that can specifically interact with a 
particular therapeutic agent or drug. 

Additionally, the methods of the invention can 
be used to determine a conformer model or pharmacophore 
model based on a bound conformation or conformation- 
dependent property of a ligand bound to polypeptides in a 
pharmacof amily . The invention is therefore advantageous 
in providing a model for the design and identification of 
therapeutic compounds having specificity for a 
pharmacof amily of polypeptides. 

Another advantage of the invention is that the 
methods provide a correlation between polypeptide 
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sequence, a parameter that is relatively easy to measure, 
and polypeptide function, polypeptide three-dimensional 
structure or bound- ligand three-dimensional structure, 
parameters of tremendous value but often difficult to 
5 measure. Therefore, the methods of the invention can be 
used to determine structural characteristics of a 
polypeptide or its bound ligand based on amino acid 
sequence of the polypeptide. Furthermore, the methods 
can be used to determine polypeptide function independent 
10 of three-dimensional structure information. 

As used herein, the term u pharmacof amily , " when 
used in reference to polypeptides, is intended to refer 
to a set of polypeptides that bind a ligand such that the 
ligand is bound in substantially the same conformation. 

15 As defined herein a "member" of a polypeptide 

pharmacofamily refers to an individual polypeptide that 
is classified in a polypeptide pharmacofamily because the 
polypeptide binds a conformation of a ligand that is 
substantially the same as a conformation of the ligand 

20 bound to another polypeptide in the pharmacofamily. 

As used herein, the term "ligand-binding 
family" is intended to refer to a set of polypeptides 
that can bind to the same ligand, or portion thereof. 
The term includes a set of polypeptides having binding 

25 activity for a common ligand with sufficient affinity, 
avidity or specificity to allow measurement of the 
binding event. As defined herein, a "member" of a 
ligand-binding family refers to an individual polypeptide 
that binds the same ligand, or portion thereof, as that 

30 which binds another polypeptide in the ligand-binding 



family. The bound conformations of a ligand bound by- 
individual members of a ligand-binding family can be 
substantially the same or different from each other. 

As used herein, the term "bound conformation/' 
when used in reference to a ligand, refers to the 
location of atoms of a ligand relative to each other in 
three dimensional space, where the ligand is bound to a 
polypeptide. The location of atoms in a ligand can be 
described, for example, according to bond angles, bond 
distances, relative locations of electron density, 
probable occupancy of atoms at points in space relative 
to each other, probable occupancy of electrons at points 
in space relative to each other or combinations thereof. 

As used herein, the term "substantially the 
same/' when used in reference to bound conformations of a 
ligand, or portion thereof, is intended to refer to two 
or more bound conformations that can be overlaid upon 
each other in 3 dimensional space such that all 
corresponding atoms between the two conformations are 
overlapped. Accordingly, "different" bound conformations 
cannot be overlaid upon each other in 3 -dimensional space 
such that all corresponding atoms between the two bound 
conformations, or portion thereof, are overlapped. 
Structural overlap can be determined as described below. 

As used herein, the term "sequence comparison 
signature" refers to a representation of the degree of 
similarity, likeness or identity for a particular amino 
acid sequence compared to a plurality of amino acid 
sequences. A representation included in the term can be 
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a set of numerical representations such as pairwise 
comparison scores or a computer readable representation 
thereof. The numerical representations can be 
represented as a string of values and can be, for 
5 example, in a computer readable format. An amino acid 
sequence included in the term can be represented by 
nucleotide sequences or other sequence strings that can 
be translated into the amino acid sequence. A query 
sequence comparison signature is a sequence comparison 

10 signature that is compared to one or more other sequence 
comparison signatures in a set or database. A plurality 
of sequences included in the term can be 2 or more 
sequences. Larger pluralities can also be included such 
as those with 10 or more, 100 or more, 1000 or more, 2000 

15 or more, 5000 or more or 100 00 or more sequences. 

As used herein, the term "pairwise comparison 
score" refers to a representation of the degree of 
similarity, likeness or identity for a particular amino 
acid sequence compared to another amino acid sequence. 

20 The representation can be a numerical value indicating a 
statistically relevant similarity between the sequence 
and the sequence model. Statistically relevant 
similarity is the probability that a score of a given 
value would be observed from the comparison of a random 

25 sequence to the length and composition of a query 
sequence and the sequences in a database. A 
statistically relevant similarity can be indicated by an 
expectation value (E-value) , local sequence identity or 
bit score as described, for example, in Durbin et al . , 

3 0 Biological Sequence Analysis Cambridge University Press 
(1998) . 



As used herein, the term clustering" refers to 
partitioning a data set into two or more subsets where 
the members within each subset are similar and members in 
different subsets are correspondingly dissimilar. A data 
5 set included in the term can contain amino acid 

sequences, or representations of relationships between 
amino acid sequences such as sequence comparison 
signatures. The term can include partitioning 
polypeptides from a ligand- binding family into two or 

10 more pharmacof amilies . Partitioning can also be based on 
similarity or dissimilarity in other structural or 
functional properties such as protein fold, SCOP-family, 
enzymatic activity, presence or absence of a particular 
structural motif or other properties set forth below. 

15 The term can include partitioning based on sequence 
comparison scores or pairwise comparison scores. The 
term can include partitioning by, for example, 
hierarchical clustering such as agglomerative clustering 
or divisive clustering as described in Manley, 

2 0 Multivariate Statistical Methods: a Primer , Chapman and 
Hall, London (1995) and Aldenderfer and Blasfield, 
Cluster Analysis , Sage Publications, Beverley Hills 
(1984) ; non-hierarchical clustering such as Jarvis- 
Patrick clustering (see, for example, Jarvis and Patrick, 

25 IEEE Trans. Commit . C-22 : 1025-1034 (1973)); or cell-based 
clustering (see, for example, Schnur, J. Chem. Inf. 
Comput. Sci. 39:36-43 (1999)). 



30 



As used herein, the term "cluster" refers to a 
subset of amino acid sequences or representations thereof 
in a set that are similar to each other and different 
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from amino acid sequences or representations thereof in 
another subset of the set . 

As used herein, the term "distance" refers to a 
representation of the degree of difference or deviation 
5 that separates two things in relationship. The term can 
include the degree of difference or deviation that 
separates amino acid sequences according to an 
evolutionary model, structural properties such as Chou- 
Fasman propensities, chemical properties such as charge, 

10 polarity or shape or combinations thereof. The distance 
can be a measure of the separation of vector 
representations of sequences in high dimensional space. 
The distance can be, for example, a Euclidian distance, 
exclusive OR distance, Tanimoto coefficient or 

15 Mahalonobis distance. 

As used herein, the term "distance arrangement" 
refers to a grouping of sequence comparison scores 
ordered relative to the degree of difference or deviation 
from each other. The term can include a graphical 
20 representation such as a matrix or tree structure. 

As used herein, the term "polypeptide" is 
intended to refer to a polymer of two or more amino 
acids. The term is intended to include polymers 
containing amino acid sterioisomers , analogues and 
25 functional mimetics thereof. For example, derivatives 

can include chemical modifications of amino acids such as 
alkylation, acylation, carbamylation, iodination, or any 
modification which derivatizes the polypeptide. 
Analogues can include modified amino acids, for example, 
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hydroxyproline or carboxyglutamate, and can include amino 
acids, or analogs thereof, that are not linked by peptide 
bonds. Mimetics encompass chemicals containing chemical 
moieties that mimic the function of the polypeptide 
5 regardless of the predicted three-dimensional structure 
of the compound. For example, if a polypeptide contains 
two charged chemical moieties in a functional domain, a 
mimetic places two charged chemical moieties in a spatial 
orientation and constrained structure so that the 
10 corresponding charge is maintained in three-dimensional 
space. Thus, all of these modifications are included 
within the term "polypeptide" so long as the polypeptide 
retains its binding function. 

As used herein, the term "ligand" refers to a 
15 molecule that can specifically bind to a polypeptide. 

Specific binding, as it is used herein, refers to binding 
that is detectable over non-specific interactions by 
quantifiable assays well known in the art such as those 
that measure association rates, dissociation rates or 
20 equilibrium association or dissociation constants. A 
ligand can be essentially any type of natural or 
synthetic molecule including, for example, a polypeptide, 
nucleic acid, carbohydrate, lipid, amino acid, nucleotide 
or any organic derived compound. The term also 
25 encompasses a cofactor or a substrate of a polypeptide 
having enzymatic activity, or substrate that is inert to 
catalytic conversion by the bound polypeptide. Specific 
binding to a polypeptide can be due to covalent or non 
covalent interactions . 
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As used herein, the term "conformer model" 
refers to a representation of points in a defined 
coordinate system wherein a point corresponds to a 
position of an atom in a bound conformation of a ligand. 
The coordinate system is preferably in 3 dimensions, 
however, manipulation or computation of a model can be 
performed in 2 dimensions or even 4 or more dimensions in 
cases where such methods are preferred, A point in the 
representation of points can, for example, correlate with 
the center of an atom. Additionally, a point in the 
representation of points can be incorporated into a line, 
plane or sphere to include a shape of one or more atom or 
volume occupied by one or more atom. A conformer model 
can be derived from 2 or more bound conformations of a 
ligand. For example a conformer model can be generated 
from 3 or more, 4 or more, 5 or more, 6 or more, 7 or 
more, 8 or more, 10 or more, 15 or more, 2 0 or more or 25 
or more bound conformations of a ligand. 

As used herein, the term "pharmacophore model" 
refers to a representation of points in a defined 
coordinate system wherein a point corresponds to a 
position or other characteristic of an atom or chemical 
moiety in a bound conformation of a ligand and/or an 
interacting polypeptide or ordered water. An ordered 
water is an observable water in a model derived from 
structural determination of a polypeptide. A 
pharmacophore model can include, for example, atoms of a 
bound conformation of a ligand, or portion thereof. A 
pharmacophore model can include both the bound 
conformations of a ligand, or portion thereof, and one or 
more atoms that both interact with the ligand and are 
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from a bound polypeptide. Thus, in addition to geometric 
characteristics of a bound conformation of a ligand, a 
pharmacophore model can indicate other characteristics 
including, for example, charge or hydrophobicity of an 
atom or chemical moiety. A pharmacaphore model can 
incorporate internal interactions within the bound 
conformation of a ligand or interactions between a bound 
conformation of a ligand and a polypeptide or other 
receptor including, for example, van der Waals 
interactions, hydrogen bonds, ionic bonds, and 
hydrophobic interactions. A pharmacophore model can be 
derived from 2 or more bound conformations of a ligand. 
For example a conformer model can be generated from 3 or 
more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or 
more, 10 or more, 15 or more, 2 0 or more or 2 5 or more 
bound conformations of a ligand. 

A point in a pharmacophore model can, for 
example, correlate with the center of an atom or moiety. 
Additionally, a point in the representation of points can 
be incorporated into a line, plane or sphere to indicate 
a characteristic other than a center of an atom or moiety 
including, for example, shape of an atom or moiety or 
volume occupied by an atom or moiety. The coordinate 
system of a pharmacophore model is preferably in 3 
dimensions, however, manipulation or computation of a 
model can be performed in 2 dimensions or even 4 or more 
dimensions in cases where such methods are preferred. 
Multidimensional coordinate systems in which a 
pharmacophore model can be represented include, for 
example, cartesian coordinate systems, fractional 
coordinate systems, or reciprocal space. The term 
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pharmacophore model is intended to encompass a conformer 
model . 

As used herein, the term "conf ormation- 
dependent property," when used in reference to a ligand, 
refers to a characteristic of a ligand that specifically 
correlates with the three dimensional structure of a 
ligand or the orientation in space of selected atoms and 
bonds of the ligand. Thus, a ligand bound to a 
polypeptide in a distinct conformation will have at least 
one unique conformation -dependent property correlated 
with the bound conformation of the ligand. A 
conformation-dependent property can be derived from or 
include the entire ligand structure or selected atoms and 
bonds, including a fragment or portion of the complete 
atomic composition of the ligand. A conformation- 
dependent property that includes selected atoms and bonds 
of a ligand can include 2 or more, 3 or more, 5 or more, 
10 or more, 15 or more, 20 or more, 25 or more, or 50 or 
more atoms of a bound conformation of a ligand. 

20 A characteristic that specifically correlates 

with a three dimensional structure of a ligand is a 
characteristic that is substantially different between at 
least two different bound conformations of the same 
ligand and, therefore, distinguishes the two different 

25 bound conformations. A conformation-dependent property 
can include a physical or chemical characteristic of a 
ligand, for example, absorption and emission of heat, 
absorption and emission of electromagnetic radiation, 
rotation of polarized light, magnetic moment, spin state 

3 0 of electrons, or polarity. A conformation-dependent 
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property can also include a structural characteristic of 
a ligand based, for example, on an X-ray diffraction 
pattern or a nuclear magnetic resonance (NMR) spectrum. 
A conformation-dependent property can additionally 
5 include a characteristic based on a structural model, for 
example, an electron density map, atomic coordinates, or 
x-ray structure. A conformation-dependent property can 
include a characteristic spectroscopic signal based on, 
for example, Raman, circular dichroism (CD) , optical 
p 10 rotation, electron paramagnetic resonance (EPR) , infrared 

"f=; (IR) , ultraviolet/visible absorbance (UV/Vis) , 

LL: 

fy fluorescence, or luminescence spectroscopies. A 

conformation-dependent property can also include a 
\f\ characteristic NMR signal, for example, chemical shift, J 

J.. 15 coupling, dipolar coupling, cross-correlation, nuclear 

ft; spin relaxation, transferred nuclear Overhauser effect, 

[T or combinations thereof. A conformation-dependent 

D property can additionally include a thermodynamic or 

kinetic characteristic based on, for example, 
20 calorimetric measurement or binding affinity measurement. 
Furthermore, a conformation-dependent property can 
include characteristic based on electrical measurement, 
for example, voltammetry or conductance. 

The invention provides a method for separating 
25 two or more subsets of polypeptides within a set of 
polypeptides. The method includes the steps of: (a) 
determining a sequence comparison signature for each 
amino acid sequence in a set of amino acid sequences, 
wherein the sequence comparison signature includes 
3 0 pairwise comparison scores for the amino acid sequence 

compared to each of the other amino acid sequences in the 
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set; (b) constructing a distance arrangement including 
the sequence comparison signatures related according to 
the distance between each of the sequence comparison 
signatures; and (c) identifying a first and second 
5 cluster of sequence comparison signatures in the distance 
arrangement, wherein the first cluster includes sequence 
comparison signatures for polypeptides having a similar 
protein fold or biological function, the protein fold or 
function being different compared to a protein fold or 

H 10 function of polypeptides having sequence comparison 

J^; signatures in the second cluster. 

~; In a particular embodiment, the invention 

yl provides a method for identifying a polypeptide 

f :; pharmacof ami ly. The method includes the steps of: (a) 

fU 15 determining a sequence comparison signature for each 

amino acid sequence in a set of amino acid sequences, 
D wherein the sequence comparison signature includes 

pairwise comparison scores for the amino acid sequence 
compared to each of the other amino acid sequences in the 

2 0 set; (b) constructing a distance arrangement including 

the sequence comparison signatures related according to 
the distance between each of the sequence comparison 
signatures; and (c) identifying separate clusters of 
sequence comparison signatures in the distance 
25 arrangement, wherein the separate clusters include 

sequence comparison signatures for sequences in the same 
ligand binding family and separate pharmacof amilies . 

A set of amino acid sequences from which 
subsets of sequences can be identified in the methods can 

3 0 include polypeptides or proteins representing a wide 
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range of structural or functional characteristics. A set 
of amino acid sequences need not have any particular 
predefined or known common characteristics. Such 
sequence sets include those found in genome or proteome 
5 databases from a particular organism or across a variety 
of organisms. Organism specific databases that can be 
used in the methods include FlyBase which contains 
sequences for the Drosophila melanogaster (The FlyBase 
consortium, Nucl . Acids. Res. 27:85-88 (1999)), the TB 
O 10 proteome (Cole et al., Nature 393:537-544 (1998), or 

human genome (Venter et al . , Science 291:1304-1351 
fly (2001), Lander et al . Nature 409:860-921 (2001)). 

*I Examples of databases that include sequences from a wide 

yfi diversity of organisms are Swiss-Prot (Bairoch et al . , 

5 15 Nucl. Acids. Res. 28:45-48 (2000)), Protein Data Bank 

i!J (PDB, operated by the Research Collaboratory for 

=~ Structural Bioinf ormatics , see Berman et al . , Nucleic 

O Acids Research , 28:235-242 (2000)), Protein Information 

Resource (PIR; McGarvey et al . , Bioinf ormatics 16:290-291 
20 (2000)), PRF and TrEMBL (Bairoch et al . , Nucl. Acids. 
Res. 28:45-48 (2000) ) . 



The methods can also be used with a set of 
amino acid sequences that are preselected for a 
particular structural or functional characteristic. A 

25 preselected range of structural or functional 

characteristics for a set of polypeptides used in the 
methods can include, for example, binding to a particular 
ligand, interacting with a particular biological 
component such as another protein, common enzymatic 

30 function, common structural motifs or folds, common 
subcellular localization or co-expression due to a 
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particular stimulus or developmental or growth stage. 
Those skilled in the art will be able to preselect a set 
of amino acid sequences based on that which is known for 
particular sequences as provided in the scientific 
5 literature or in anotations of particular databases. 
Examples of subsets of polypeptides from which subsets 
can be identified in the methods of the invention 
include, for example, kinases, G-Protein coupled 
receptors, nuclear factors, proteases, dehydrogenases, 
10 phosphatases, transcription factors, nucleotide binding 
enzymes or membrane proteins. 

Polypeptides of a set can be preselected for 
their ability to specifically bind to the same ligand, or 
portion thereof. " Use of the methods with a set of amino 

15 acid sequences that is preselected for the ability to 
bind a particular ligand is demonstrated in Example II. 
Specific binding between a polypeptide and a ligand can 
be identified by methods known in the art. Methods of 
determining specific binding include, for example, 

20 equilibrium binding analysis, competition assays, and 
kinetic assays as described in Segel, Enzyme Kinetics 
John Wiley and Sons, New York (1975), and Kyte, Mechanism 
in Protein Chemistry Garland Pub. (1995) . Thermodynamic 
and kinetic constants can be used to identify and compare 

25 polypeptides and ligands that specifically bind each 
other and include, for example, dissociation constant 
(K d ) , association constant (K a ) , Michaelis constant (Kj , 
inhibitor dissociation constant (K is ) association rate 
constant (k on ) or dissociation rate constant (k off ) . For 

3 0 example, a family can be identified as having members 

that can specifically bind a ligand with a K d of at most 



10" 3 M, 1(T 4 M # 10" 5 M, 10" 6 M f 10" 7 M, 1CT 8 M, 10" 9 M, 10" 10 M, 
10" 11 M, or 1CT 12 M or lower. 

The use of a preselected set of amino acid 
sequences provides the advantage of narrowing the number 
of sequences to be compared thereby reducing 
computational demands for the methods. In addition, 
preselection can narrow the structural and functional 
diversity represented in the identified subsets to focus 
on desired characteristics. For example, a family of 
polypeptides known to bind a particular ligand can be 
used as a set in the methods thereby focusing the 
comparison on characteristics of ligand binding including 
the bound conformation of the ligand or the structure of 
the ligand binding site. 

A set of amino acid sequences used in the 
methods can be translated from one or more nucleic acid 
sequences in a nucleic acid sequence database. 
Accordingly, the methods can include a step of 
translating the coding regions of a nucleic acid sequence 
into amino acid sequences. A coding region of a nucleic 
acid sequences can be translated according to the 
appropriate genetic code for the organism from which the 
nucleic acid sequence is derived. The coding region can 
be a predetermined portion of the sequence or in the case 
where exons and introns are present a predetermined set 
of spliced portions identified, for example, from 
annotations of the nucleic acid in the database. 
Alternatively, the coding region can be predicted or 
determined based on methods known in the art for 
predicting gene structure or coding sequence location. 
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Computational methods for predicting the coding region of 
a nucleic acid sequence are known in the art as described 
in Pevzner, Computational Molecular Biology, an 
Algorithmic Approach , The MIT Press, Cambridge MA (200 0) 
and include, for example, statistical approaches based on 
codon usage or in- frame hexamer count, similarity based 
approaches, spliced alignment approaches and Hidden 
Markov based approaches such as GENSCAN. 

A nucleic acid sequence databases from which a 
set of amino acid sequences is translated can contain a 
variety of types of nucleic acids including, for example, 
genomic DNA sequences, cDNA sequences or mRNA sequences 
or combinations thereof. An example of a database 
including a variety of types of nucleic acid sequences is 
GenBank. Other nucleic acid sequence database useful in 
the methods include a genome database such as any of 
those described above. 

A set of amino acid sequences used in the 
methods can include full protein sequences or fragments 
thereof. One or more amino acid sequence fragments 
present in a set of sequences used in the methods can 
correlate with particular exons or domains of a protein. 
An amino acid sequence fragment can also be translated 
from an Expressed Sequence Tag (EST) . Thus the methods 
can be used to identify, classify or characterize 
proteins based on sequence fragments. For example, 
identification of a subset of polypeptides to which a 
translated EST sequence belongs can be used to predict 
the structure or function of the polypeptide encoded by 
the EST. Similarly, the methods can be used to identify, 
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classify or characterize portions of proteins such as 
domains or exon encoded regions based on similar 
structure or function independent of the characteristics 
of other regions of the protein from which the fragment 
5 is derived. 



Amino acid sequences in a set are compared in 
the methods on the basis of the sequence comparison 
signatures for each sequence. The sequence comparison 
signature can be any representation of the degree of 

10 similarity, likeness or identity for a particular amino 
acid sequence compared to the other amino acid sequences 
in the set. Such representations can include similarity 
scores calculated using any search algorithm or method of 
pairwise sequence comparison known by those skilled in 

15 the art such as those described below. 

The dynamic programing algorithm is a 
mathematically rigorous method of pairwise sequence 
comparison and can be used according to several variants 
including, for example, Needleman-Wunsch (Needleman and 

20 Wunsch, J. Mol . Biol. 48:443-453 (1970)), Sellers 

(Sellers, J. Appl. Math. 26:787-793 (1974)), quasi-global 
alignment (Sellers Proc. Natl. Acad. Sci. USA 76:3041- 
3041 (1979)) and Smith-Waterman (Smith and Waterman, J. 
Mol. Biol. 147:195-197 (1981) and Waterman and Eggert, J. 

25 Mol. Biol. 197:723-728 (1987)). The dynamic programming 
algorithm is rigorous and therefor, well suiued for 
finding optimum alignments and sequence comparison scores 
for a set of amino acid sequences. The dynamic 
programing algorithm, being rigorous is also 

3 0 computationally demanding. In applications of the 



methods in which large sequence sets are used or less 
rigorous comparison is required a heuristic search 
algorithm can be used. 

Heuristic algorithms that can be used in the 

5 methods of the invention include, for example, BLAST and 
FASTA. BLAST, Basic Local Alignment Search Tool, uses a 
heuristic algorithm that reduces the computational 
requirements of the Smith-Waterman algorithm by seeking 
local alignments prior to comparing sequences in a 

0 restricted version of the Smith-Waterman algorithm. 
BLAST is therefore able to detect relationships among 
sequences including those which share only isolated 
regions of similarity including, for example, protein 
domains (Altschul et al., J. Mol. Biol. 215:403-410 

5 (1990) ) . BLAST divides sequences into a list of 

overlapping words and extends the list to include all 
words that score above a predefined matrix-defined 
threshold. This threshold limits the number of matches 
that will be passed from the heuristic screening step to 

0 the comparison step. Those skilled in the art can use 
BLAST according to a default parameters as described by 
Tatiana et al., FEMS Microbial Lett . 174:247-250 (1999) 
or on the National Center for Biotechnology Information 
web page at ncbi.nlm.gov/BLAST/. Alternatively, 

5 parameters such as the length of the words, value of the 
predefined matrix-defined threshold or type of similarity 
matrix utilized can be adjusted to suit a particular 
application of the methods of the invention. 

In addition to the originally described BLAST 
0 (Altschul et al., supra, 1990), modifications to the 



algorithm have been made (Altschul et al., Nucleic Acids 
Res. 25:3389-3402 (1997)). One modification is Gapped 
BLAST, which allows gaps, either insertions or deletions, 
to be introduced into alignments. Allowing gaps in 
alignments tends to reflect biologic relationships more 
closely. For example, gapped BLAST can be used to 
identify sequence identity within similar domains of two 
or more proteins. A second modification is PS I -BLAST, 
which is a sensitive way to search for sequence homologs . 
PSI -BLAST performs an initial Gapped BLAST search and 
uses information from any significant alignments to 
construct a position- specif ic score matrix, which 
replaces the query sequence for the next round of 
database searching. A PSI -BLAST search is often more 
sensitive to weak but biologically relevant sequence 
similarities . 

FAST A uses a word search algorithm as a 
heuristic screen prior to performing a restricted Smith- 
Waterman alignment (Pearson and Lippman, Proc. Natl. 
Acad. Sci. USA 85:2444-2448 (1988)). In the word search 
both the query and library sequences are divided into 
overalapping words of specified length. The lists of 
words for the Query and library sequences are compared in 
a matrix and the diagonal with the most matching words is 
taken as the region most likely to contain the best 
alignment. The results from the word search are used to 
identify sequences with sufficient similarity to use in 
the subsequent alignment step. Those skilled in the art 
can use default parameters or adjust parameters such as 
word size, window size for the defining the length of 
insertions or deletions one sequence can accumulate 
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relative to another or the type of similarity matrix 
utilized. 



A similarity matrix used in a sequence 
comparison algorithm can be any that quantifies the 
5 probability that a particular substitution of one amino 
acid for another will preserve or disrupt the physical 
and chemical properties necessary to the structure or 
function of the polypeptide. Similarity matrices can be 
based on evolutionary models; structural properties such 

10 as Chou-Fasman propensities; chemical properties such as 
charge, shape, or polarity; and combinations thereof. 
Examples of similarity matrices known in the art and 
useful in the invention include the PAM matrix and BLOSUM 
matrix as described in Nicholas et al . , Biotechniques , 

15 28:1174-1191 (2000). In addition, the scale of the 

similarity matrix used in the comparison algorithm can be 
adjusted to suit the set of amino acids being compared or 
the resulting range of percent identity. Examples of 
differently scaled matrices that can be used in the 

20 methods of the invention include PAM40, PAM120, PAM240, 
BLOSUM60, BLOSUM40 and BLOSOM30 as described, for 
example, in Nicholas et al., supra (2000). 

Once similarity scores have been determined for 
an amino acid sequence compared to the other sequences in 

25 a set, a sequence comparison signature containing these 
scores can be created. A sequence comparison signature 
of the invention can include any of a variety of known 
comparison scores including scores provided by the above- 
described algorithms such as E- scores, sequence identity 

30 scores or Bit-scores. These scores can be binned into 



representation such as a string of values as described in 
Example I . 

The methods can include a step of converting 
sequence similarity scores by a uniform transformation. 
5 Any uniform transformation capable of converting the 
sequence similarity scores or the sequence comparison 
signatures in which they reside into a format amenable to 
comparison and clustering can be used in the methods of 
the invention. For example, a pairwise similarity score 
10 can be converted to a binary score indicating presence or 
absence of similarity between the two sequences being 
compared. Assignment of a binary score can be determined 
by a predefined percent identity cutoff where two 
sequences having a percent identity below the cutoff 
15 value are assigned a score of 0 indicating absence of 
similarity and 2 sequences having a percent identity 
above the cutoff are assigned a score of 1 indicating 
presence of similarity. Adjustment of the cutoff value 
can be used to alter the sensitivity and selectivity of 
20 the methods. In particular, as the cutoff is increased 
sensitivity is reduced, due to the reduction in the 
number of related sequences identified, and selectivity 
is increased due to the decrease in the number of 
unrelated sequences identified as being similar. 
25 Sequence similarity scores can also be binned according 
to particular ranges of identity or similarity as 
demonstrated in Example I where sequence similarity 
scores are binned into 10 groups. 

Conversion of sequence similarity scores with a 
3 0 uniform transformation can include a mathematical 
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manipulation such as an inverse function (for example, 
1/score) , an exponential function (for example, e - score ) or 
an inverse of an exponential (for example, l/lO score ) . 
Another conversion useful in the methods is a hashing 
5 algorithm which can be used to generate a hashkey from 
the sequence comparison scores. A hashkey is a compact 
numerical representation used to solve indexing problems 
as described in Pieprzyk and Sadeghiyan, "Design of 
Hashing Algorithms" Lecture N otes in Computer Science, 
10 Vol. 756 Springer-Verlag (1993). A hashkey can be used 
to assign a memory address to a sequence similarity score 
or its vector, thereby reducing computation time required 
for clustering. 

A set of sequence similarity signatures that 
15 have been determined for a set of proteins can be related 
to each other according to the distance separating each 
sequence similarity signature from the other. A 
convenient representation for relating sequence 
similarity signatures is points in space. A sequence 

2 0 similarity signature can be represented in high 

dimensional space as a vector, where each pairwise 
distance value, or converted value thereof, is a point in 
each coordinate of the space. Proximity of the points in 
this space indicates similarity, whereas points that are 
25 distal are dissimilar. 

The distance between a two similarity 
signatures, that are represented as a first and second 
vector in high dimensional space, can be determined based 
on the distances separating the points of the first 

3 0 vector from the points of the second vector. A variety 
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of distance measures are known in the art can be used in 
the methods of the invention including, for example, 
Euclidian distance. Euclidian distance is the square 
root of the sum of the difference between each of the 
5 elements in the two compared vectors, squared. Another 
distance is the Mahalanobis distance, which scales the 
difference in each coordinate by the inverse of the 
variance in that dimension as described, for example, in 
Mahalanobis, Proc . Natl. Acad. Sci. USA 12:49-55 (1936). 
10 The cosine of the angle between the two vectors can also 
be computed and used as a distance metric. Hamming 
distance between two vectors is also useful in the 
methods of the invention and it is given by the count of 
the number of elements in which the two vectors differ. 

15 Distances that are particularly useful when 

binary sequence comparison scores are used include, for 
example, the exclusive OR which is a reduction of a 
hamming distance to a binary case, again being a count of 
the number of elements differing between the two vectors 

2 0 that are compared. The Tanimoto coefficient is the ratio 

of bits set (where a bit set is a bit that is equal to 1) 
for both vectors to the total number of bits set in 
either vector. A generalization of the Tanimoto 
coefficient is the Tversky Similarity, where both vectors 
25 can be given different weighting as described in Sneath 
and Sokal, Numerical Taxonomy WH Freeman, San Francisco 
(1973) . Those skilled in the art will recognize that 
this is only a partial list of the methods known in the 
art for measuring distance between vectors and will be 

3 0 able to use other known methods for measuring distance 

between vectors ion the methods to determine the distance 
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between sequence similarity signatures according to the 
teaching herein. 



In addition to the distance arrangements set 
forth above, a variety of other formats that are 
5 convenient for comparing distances can also be used 

including, for example, a matrix as described in Example 
I or tree structure as described in Durbin et al . , supra 
(1998) . 

Once a distance arrangement has been created, 

10 sequence comparison signatures within predefined 

distances can be grouped using a clustering algorithm 
including, for example, a hierarchical clustering 
algorithm such as a agglomerative or divisive 
hierarchical clustering algorithm (see, for example, 

15 Kaufman and Rousseeuw, Finding Groups in Data: An 

introduction to Cluster Analysis John Wiley and Sons, New 
York (1990)). A non-hierarchical clustering algorithm 
can also be used such as the Jarvis-Patrick algorithm 
(Jarvis and Patrick, IEEE Trans. Comput . C-22 : 1025-1034 

20 (1973)). The Jarvis-Patrick algorithm clusters sequence 
similarity signatures according to the number of nearest 
neighbors. Although the determination of which points 
are neighbors is dependent upon the distance between 
neighbors, clustering by the Jarvis-Patrick algorithm is 

25 not based solely on distance. Clustering can also be 

achieved with a cell-based clustering algorithm (see, for 
example, Schnur, J. Chem. Inf. Comput. Sci. 39:36-43 
(1999)). Cell-based clustering divides the space 
containing sequence similarity signatures into areas or 

3 0 volumes and clusters those that fall into the same 
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volume. The cell-based method is not based solely on 
distance since points that are separated by a cell 
division although quite proximal can be separated into 
different clusters. Clustering in cell-based methods is 
5 dependent upon the size and shape of the cells which can 
be adjusted to alter the number of clusters identified or 
range of similarity of sequences in each cluster to suit 
a particular application of the method. 

Clusters that have been created based on 

10 sequence scores can be evaluated to identify subsets of 
polypeptides having one or more common structural or 
functional characteristics. Such an evaluation can be 
used to confirm membership of a polypeptide in a 
particular subset or to determine membership for a 

15 polypeptide that is apparently similar to clusters for 
more than one subset. The structural and functional 
similarities can be any that are encoded by the amino 
acid sequences of the polypeptides identified. 
Structural similarities of subsets identified by the 

20 methods can include, for example, similar protein fold 

such as those present in particular SCOP families (Murzin 
et al. f J. Mol. Biol. 247:536-540 (1995)). The subgroups 
identified by the methods can have similar overall 
protein fold or regions of similar fold such as domains, 

25 active sites or binding sites. 

Protein fold refers to the specific geometric 
arrangement and connectivity of a combination of 
secondary structure elements in a polypeptide structure. 
Secondary structure elements of a polypeptide that can be 
3 0 arranged into a fold including, for example, alpha 



helices, beta sheets, turns and loops are well known in 
the art. Folds of a polypeptide can be recognized by one 
skilled in the art and are described in, for example, 
Branden and Tooze, Introduction to protein structure, 
5 Garland Publishing, New York (1991) and Richardson, Adv^ 
Prot. Chem. 34:167-339 (1981). An example of a ligand- 
binding family of polypeptides having members with 
different folds is the NAD (P) binding polypeptides within 
which the folds include, for example, the NAD (P) (H) 
10 binding Rossman fold, heme-linked catalase fold, p-a TIM 
barrel fold, dihydrof olate reductase fold, FAD/NAD (P) (H) 
binding domain fold and the ferrodoxin like fold as 
described in U.S. Patent Application Number 09/753,020, 
which is hereby incorporated by reference. 

15 The methods can be used to identify polypeptide 

subsets containing members that share one or more 
characteristics other than common three-dimensional 
structure, protein fold or SCOP family membership. In 
particular, some polypeptides are known to have similar 

20 protein fold or to be classified into the same SCOP 

family but to have different functions. An advantage of 
the invention is that subsets can be identified based on 
similarities in function that are not immediately 
apparent from structural similarity or even pairwise 

25 sequence comparison. In addition, the methods can be 

used to identify a member of a subset of sequences based 
on one or more common characteristics other than three- 
dimensional structure, protein fold or SCOP family 
membership, the common characteristics including, for 

3 0 example, functional similarities. Functional 

similarities of subsets of polypeptides identified by the 
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methods can include, for example, binding to a common 
ligand, similar enzymatic activity or similar subcellular 
localization. Use of the methods to identify subsets of 
polypeptides having similar enzyme function are 
5 demonstrated in Example II where subsets within the 

family of NAD (P) (H) binding polypeptides are identified 
including, for example, the dehydrogenases, reductases, 
isomerases, oxidases, catalases, synthases, cyclases, 
transferases, glucosidases and galactosidases listed in 

10 Table 1. Pharmacof amilies containing members that bind 
to substantially the same bound conformation of a ligand 
can also be identified by evaluating clustered 
polypeptides based on the structures of bound ligands. 
Use of the methods to identify pharmacof amilies that bind 

15 to a common pharmacophore is also demonstrated in Example 
II . 

The methods of the invention can be used to 
identify any number of pharmacof amilies in a family up to 
and including the number of different bound conformations 

2 0 of a ligand that can be distinguished in the family. In 

cases where two or more polypeptide pharmacof amilies 
reside in a polypeptide family, clusters containing 
different pharmacof amilies can be distinguished according 
to differences in bound conformations of a ligand bound 
25 to the polypeptides. In this case, a bound conformation 
of a ligand can be determined and compared according to 
the methods described below. Polypeptides bound to 
different bound conformations of a ligand can be 
identified as those that do not show substantial overlap 

3 0 of all corresponding atoms when bound conformations are 

overlaid. Thus, polypeptides that bind different bound 



conformations of a ligand can be separated into different 
pharmacof amilies . Pharmacof amilies in turn can be 
identified as containing polypeptides that bind 
substantially the same bound conformation of a ligand. 

5 A bound conformation of a ligand bound to a 

polypeptide can be determined from a previously observed 
molecular structure or from data specifying a molecular 
structure for a bound conformation of a ligand. 
Previously observed structures can be acquired for use in 

10 the invention by searching a database of existing 
structures. An example of a database that includes 
structures of bound conformations of ligands bound to 
polypeptides is the Protein Data Bank (PDB, operated by 
the Research Collaboratory for Structural Bioinf ormatics , 

15 see Berman et al., Nucleic Acids Research , 28:235-242 
(2000)). A database can be searched, for example, by 
querying based on chemical property information or on 
structural information. In the latter approach, an 
algorithm based on finding a match to a template can be 

20 used as described, for example, in Martin, "Database 
Searching in Drug Design," J. Med. Chem. 35:2145-2154 
(1992) . 

A bound conformation of a ligand bound to a 
polypeptide can be determined from an empirical 

25 measurement, or from a database. Data specifying a 

structure can be acquired using any method available in 
the art for structural determination of a ligand bound to 
a polypeptide. For example, X-ray crystallography can be 
performed with a crystallized complex of a polypeptide 

3 0 and ligand to determine a bound conformation of the 



ligand bound to the polypeptide. Methods for obtaining 
such crystal complexes and determining structures from 
them are well known in the art as described for example 
in McRee et al . , Practical Protein Crystallography , 
Academic Press, San Diego 1993; Stout and Jensen, X-ray 
Structure Determination: A practical guide , 2 nd Ed. Wiley, 
New York (1989) ; and McPherson, The Preparation and 
Analysis of Protein Crystals , Wiley, New York (1982) . 
Another method useful for determining a bound 
conformation of a ligand bound to a polypeptide is 
Nuclear Magnetic Resonance (NMR) . NMR methods are well 
known in the art and include those described for example 
in Reid, Protein NMR Techniques , Humana Press, Totowa NJ 
(1997); and Cavanaugh et al . , Protein NMR Spectroscopy: 
Principles and Practice , ch. 7, Academic Press, San Diego 
CA (1996) . 

A bound conformation of a ligand can also be 
determined from a hypothetical model. For example, a 
hypothetical model of a bound conformation of a ligand 
can be produced using an algorithm which docks a ligand 
to a polypeptide of known structure and fits the ligand 
to the polypeptide binding site. Algorithms available in 
the art for fitting a ligand structure to a polypeptide 
binding site include, for example, DOCK (Kuntz et al . , 
Mol. Biol. 161:269-288 (1982)) and INSIGHT98 (Molecular 
Simulations Inc., San Diego, CA) . 

Common structural properties can be identified 
by comparing the three dimensional structures of two or 
more polypeptides or a bound ligand using methods known 
in the art including, for example, cluster analysis of 
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structures, visual inspection and pairwise structural 
comparisons. Cluster analysis of structures is commonly 
performed by, but not limited to, partitioning methods or 
hierarchical methods as described, for example, in 
5 Kauffman and Rousseeuw, Finding Groups in Data: An 
Introduction to Cluster Analysis , John Wiley and Sons 
Inc., New York (1990). Partitioning methods that can be 
used include, for example, partitioning around mediods, 
clustering large applications, and fuzzy analysis, as 
D 10 described in Kauffman and Rousseeuw, supra. Hierarchical 

.h methods useful in the invention include, for example, 

w 

fU agglomerative nesting, divisive analysis, and monothetic 

7f: analysis, as described in Kauffman and Rousseeuw, supra. 

Ul Algorithms for cluster analysis of molecular structures 

y : 15 are known in the art and include, for example, COMPARE 

j}! (Chiron Corp, 1995; distributed by Quantum Chemistry 

LI program Exchange, Indianapolis IN) . COMPARE can be used 

H to make all possible pairwise comparisons between a set 

of conformations of polypeptides or bound ligands or 
20 portions thereof. COMPARE reads PDB files and uses a 

Ferro -Hermanns ORIENT algorithm for a least squares root 
mean square (RMS) fit. The structures can be clustered 
into groups using the Jarvis-Patrick nearest neighbors 
algorithm. Based on the RMS deviation between 
25 polypeptide structures or bound conformations of a 
ligand, or portions thereof, a list of 'nearest 
neighbors' for each structure is generated. Two 
structures are then grouped together or clustered if: (1) 
the RMS deviation is sufficiently small and (2) if both 
3 0 structures share a determined number of common 

x neighbors' . Both criteria are adjusted by the program 
to generate clusters based on a user defined cutoff for 



distance between individual clusters. Follow up analysis 
can be conducted using Insight I I to verify structural 
clusters. Thus, two or more polypeptides can be 
confirmed as being in the same cluster or a polypeptide 
can be assigned to one of two or more proximal clusters 
based on common cluster assignment evaluated by both 
sequence based clustering and structure-based clustering. 

Structural similarity can also be identified by 
overlaying two or more structures to determine a degree 
of overlap. For example, two structures can be compared 
based on the proximity of centroid position for each atom 
using known algorithms such as the OVERLAY routine in 
INSIGHT98 (Molecular Simulations Inc., San Diego CA) . 
The degree of overlap can be determined based on root 
mean square deviation as described below. Two or more 
structures that show substantial similarity in structural 
overlap can be used to produce an average structure. The 
averaged structure can, in turn, be used as a template 
for comparing a polypeptide structure or bound 
conformation of a ligand to determine membership in a 
subset or pharmacof amily . Methods for comparing bound 
conformations of a ligand and producing an average 
structure are described in U.S. Patent Application Number 
09/753,020, which is hereby incorporated by reference. 

Using methods such as those described above, 
one skilled in the art will know how to identify 
structures that are substantially the same. For example, 
similarity can be evaluated according to the goodness of 
fit between two or more three-dimensional models of a 
polypeptide or bound ligand, or fragments thereof. 



Goodness of fit can be represented by a variety of 
parameters known in the art including, for example, the 
root mean square deviation (RMSD) . A lower RMSD between 
structures correlates with a better fit compared to a 
5 higher RMSD between structures (see for example, Doucet 
and Weber, Computer-Aided Molecular Design : Theory and 
Applications , Academic Press, San Diego, CA (1996)). 
Polypeptides having substantially the same structures can 
be identified by comparing mean RMSD values for the 

10 backbones of thepolypeptides . Polypeptides, or fragments 
thereof, having substantially the same structures can 
have a mean backbone RMSD compared to each other that is 
less than about 5 A or less than about 3 A. Those 
skilled in the art will know that despite a high RMSD 

15 between overall structures indicating overall structural 
differences, two polypeptides can contain domains or 
other regions that are similar. Thus, a model used in 
comparing polypeptide structures can be that of the 
backbone structure of a domain or other region of the 

20 polypeptide. Bound conformations of a ligand having 
substantially the same structures can have a mean RMSD 
compared to each other that is less than about 1 . 1A 

The subset or pharmacof amily to which an 
apparently clustered polypeptide belongs can also be 

25 identified by comparing the RMSD for its structure or for 
the bound conformation of its ligand to the structures of 
members in multiple clusters. Using this value for 
comparison, a member polypeptide is identified as having 
a smaller RMSD when compared to the coordinates of one or 

3 0 more structures within its subset or pharmacof amily than 
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when compared to the coordinates of one or more 
structures in another subset or pharmacof amily . In 
addition, a member of a subset or pharmacof amily can be 
identified as having an RMSD compared to one or more 
5 polypeptide or ligand structures of the members in the 
subset or pharmacof amily that are smaller than the RMSD 
between the average coordinates of the polypeptide or 
ligand structures in each cluster. 

In addition, bound conformations of a ligand 
10 can be compared with respect to dihedral angles at 

particular bonds. Comparison between dihedral angles can 
be used, for example, in combination with overall RMSD 
comparisons such as those described above. Therefore, 
bound conformations that are not easily distinguished by 
15 comparison of overall RMSD alone, can be distinguished 

according to the combined comparison of RMSD and dihedral 
angle. Bound conformations of a ligand that are bound to 
members of different pharmacof amilies can have dihedral 
angles that differ, for example, by at least about 10 
20 degrees, 30 degrees, 45 degrees, 90 degrees or 180 
degrees . 

A molecular structure can be conveniently 
stored and manipulated using structural coordinates. 

25 Structural coordinates can occur in any format known in 
the art so long as the format can provide an accurate 
reproduction of the observed structure. For example, 
crystal coordinates can occur in a variety of file types 
including, for example, .fin, .df, .phs, or .pdb as 

3 0 described for example in McRee, supra. One skilled in 

the art will recognize that structural coordinates can be 
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derived from any method known in the art to determine the 
structure of a polypeptide or bound ligand including, for 
example, X-ray crystallographic analysis or NMR 
spectroscopy. 

5 Structures at atomic level resolution can be 

useful in the methods of the invention. Resolution, when 
used to describe molecular structures, refers to the 
minimum distance that can be resolved in the observed 
structure. Thus, resolution where individual atoms can 

10 be resolved is referred to in the art as atomic 
resolution. Resolution is commonly reported as a 
numerical value in units of Angstroms (A, 10" 10 meter) 
correlated with the minimum distance which can be 
resolved such that smaller values indicate higher 

15 resolution. Structural models useful in the methods of 
the invention can have a resolution better than about 10 
A, 5 A, 3 A, 2.5 A, 2.0 A, 1.5 A, 1.0 A, 0.8 A, 0.6 A, 
0.4 A, or about 0 . 2 A or better. Resolution can also be 
reported as an all atom RMSD as used, for example, in 

20 reporting NMR data. Bound conformations of a ligand 
useful in the methods of the invention can have an all 
atom RMSD better than about 10 A, 5 A, 3 A, 2.5 A, 2.0 A, 
1.5 A, 1.0 A, 0.8 A, 0.6 A, 0.4 A, or about 0 . 2 A or 
better. 

25 Any representation that correlates with the 

structure of a molecule can be used in the methods of the 
invention. For example, a convenient and commonly used 
representation is a displayed image of the structure. 
Displayed images that are particularly useful for 

3 0 determining the structure of a polypepetide or a bound 
conformation of a ligand include, for example, ball and 



stick models, density maps, space filling models, surface 
map, Connolly surfaces, Van der Waals surfaces or CPK 
models. Display of images as a computer output, for 
example, on a video screen can be advantageous for the 
structural overlay and clustering methods described 
herein. 

The invention can be used with any ligand that 
binds to two or more different polypeptides havin 
different sequences including, for example, chemical or 
biological molecules such as simple or complex organic 
molecules, metal-containing compounds, carbohydrates, 
peptides, peptidomimetics , carbohydrates, lipids, nucleic 
acids, and the like. 

In one embodiment, the methods of the invention 
can be used with a ligand that is a nucleotide derivative 
including, for example, a nicotinamide adenine 
dinucleotide-related molecule. Nicotinamide adenine 
dinucleotide-related (NAD-related) molecules that can be 
used in the methods of the invention can be selected from 
the group consisting of oxidized nicotinamide adenine 
dinucleotide (NAD + ) , reduced nicotinamide adenine 
dinucleotide (NADH) , oxidized nicotinamide adenine 
dinucleotide phosphate (NADP + ) , and reduced nicotinamide 
adenine dinucleotide phosphate (NADPH) . An NAD-related 
molecule can also be a mimetic of the above- described 
molecules . 

In another embodiment, the methods of the 
invention can be used with a ligand that is an adenosine 
phosphate-related molecule. Adenosine phosphate-related 
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molecules can be selected from the group consisting of 
adenosine triphosphate (ATP) , adenosine diphosphate 
(ADP) , adenosine monophosphate (AMP) , and cyclic 
adenosine monophosphate (cAMP) . An adenosine phophate- 
5 related molecule can also be a mimetic of the above- 
described molecules. A mimetic of an adenosine 
phosphate-related molecule that can be used in the 
invention includes, for example, quercetin, 
adenylylimidodiphosphate (AMP-PNP) or olomoucine. 

10 A ligand useful in the methods of the invention 

can be a cofactor, coenzyme or vitamin including, for 
example, NAD, NADP, or ATP as described above. Other 
examples include thiamine (vitamin B ± ) , riboflavin 
(vitamin B 2 ) , pyridoximine (vitamin B 6 ) , cobalamin 

15 (vitamin B 12 ) , pyrophosphate, flavin adenine dinucleotide 
(FAD) , flavin mononucleotide (FMN) , pyridoxal phosphate, 
coenzyme A, ascorbate (vitamin C) , niacin, biotin, heme, 
porphyrin, folate, tetrahydrofolate, nucleotide such as 
guanosine triphosphate, cytidine triphosphate, thymidine 

2 0 triphosphate, uridine triphosphate, retinol (vitamin A) , 
calciferol (vitamin D 2 ) , ubiquinone, ubiquitin, ct- 
tocopherol (vitamin E) , farnesyl, geranylgeranyl , pterin, 
pteridine or S-adenosyl methionine (SAM) . 

A polypeptide can be used as a ligand in the 
25 invention. For example, a ligand can be a naturally 
occurring polypeptide ligand such as a ubiquitin or 
polypeptide hormone including, for example, insulin, 
human growth hormone, thyrotropin releasing hormone, 
adrenocorticotropic hormone, parathyroid hormone, 
30 follicle stimulating hormone, thyroid stimulating 
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hormone, luteinizing hormone, human chorionic 
gonadotropin, epidermal growth factor, nerve growth 
factor and the like. In addition a polypeptide ligand 
can be a non-naturally occurring polypeptide that has 
5 binding activity. Such polypeptide ligands can be 
identified, for example, by screening a synthetic 
polypeptide library such as a phage display library or 
combinatorial polypeptide library as described below. A 
polypeptide ligand can also contain amino acid analogs or 

10 derivatives such as those described below. Methods of 
isolation of a polypeptide ligand are well known in the 
art and are described, for example, in Scopes, Protein 
Purification: Principles and Practice , 3 rd Ed., Springer- 
Verlag, New York (19 94) ; Duetscher, Methods in 

15 Enzymology , Vol 182, Academic Press, San Diego (1990); 

and Coligan et al . , Current protocols in Protein S cience, 
John Wiley and Sons, Baltimore, MD (2000) . 

A nucleic acid can also be used as a ligand in 
the invention. Examples of nucleic acid ligands useful 

2 0 in the invention include DNA, such as genomic DNA or cDNA 
or RNA such as mRNA, ribosomal RNA or tRNA. A nucleic 
acid ligand can also be a synthetic oligonucleotide. 
Such ligands can be identified by screening a random 
oligonucleotide library for ligand binding activity, for 

25 example, as described below. Nucleic acid ligands can 
also be isolated from a natural source or produced in a 
recombinant system using well known methods in the art 
including, for example, those described in Sambrook et 
al., Molecular Cloning: A Laboratory Manual , 2nd ed., 

30 Cold Spring Harbor Press, Plainview, New York (1989) ; 
Ausubel et al . , Current Protocols in Molecular Biology 
(Supplement 47), John Wiley & Sons, New York (1999) . 
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A ligand used in the invention can be an amino 
acid, amino acid analog or derivatized amino acid. An 
amino acid ligand can be one of the 20 essential amino 
acids or any other amino acid isolated from a natural 
5 source. Amino acid analogs useful in the invention 
include, for example, neurotransmitters such as gamma 
amino butyric acid, serotonin, dopamine, or 
norepenephrine or hormones such as thyroxine, epinephrine 
or melatonin. A synthetic amino acid, or analog thereof, 

10 can also be used in the invention. A synthetic amino 

acid can include chemical modifications of an amino acid 
such as alkylation, acylation, carbamylation, iodination, 
or any modification that derivatizes the amino acid. 
Such derivatized molecules include, for example, those 

15 molecules in which free amino groups have been 

derivatized to form amine hydrochlorides, p- toluene 
sulfonyl groups, carbobenzoxy groups, t-butyloxycarbonyl 
groups, chloroacetyl groups or formyl groups. Free 
carboxyl groups can be derivatized to form salts, methyl 

20 and ethyl esters or other types of esters or hydrazides. 
Free hydroxyl groups can be derivatized to form O-acyl or 
O-alkyl derivatives. The imidazole nitrogen of histidine 
can be derivatized to form N-im-benzylhistidine . 
Naturally occurring amino acid derivatives of the twenty 

25 standard amino acids can also be included in a cluster of 
bound conformations including, for example, 
4 -hydroxyproline , 5 -hydroxy lysine , 3 -methylhist idine , 
homoserine, ornithine or carboxyglutamate . 

A lipid ligand can also be used in the 
3 0 invention. Examples of lipid ligands include 

triglycerides, phospholipids, glycolipids or steroids. 
Steroids useful in the invention include, for example, 



43 

glucocorticoids , mineralocorticoids , androgens , estrogens 
or progestins. 



Another type of ligand that can be used in the 
invention is a carbohydrate. A carbohydrate ligand can 
5 be a monosaccharide such as glucose, fructose, ribose, 
glyceraldehyde , or erythrose; a disaccharide such as 
lactose, sucrose, or maltose; oligosaccharide such as 
u those recognized by lectins such as agglutinin, peanut 

C lectin or phytohemagglutinin, or a polysaccharide such as 

[H 10 cellulose, chitin, or glycogen. 

UU 

yp Once two or more subsets of polypeptides have 

yi been identified in a particular set of amino acid 

M: sequences, another sequence can be compared to the set to 

L; ; ; identify to which subset the sequence belongs. 

15 Therefore, the invention provides a method for 

identifying a subset of polypeptides. The method 
includes the steps of: (a) determining a query sequence 
comparison signature for an amino acid sequence, wherein 
the query sequence comparison signature includes pairwise 
20 comparison scores for the amino acid sequence compared to 
each amino acid sequence in a set; (b) comparing the 
distance between the query sequence comparison signature 
and the sequence comparison signatures for other amino 
acid sequences in the set, wherein the sequence 
25 comparison signatures for other amino acid sequences in 
the set are clustered into two or more subseus; and (c) 
identifying a proximal cluster having one or more 
sequence comparison signature that has a closer distance 
to the query sequence comparison signature than the 
3 0 sequence comparison signatures of a distal cluster. 
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A similar method can be used to identify a 
member of a pharmacof amily . The method includes the 
steps of: (a) determining a query sequence comparison 
signature for an amino acid sequence, wherein the query 
5 sequence comparison signature includes pairwise 

comparison scores for the amino acid sequence compared to 
each amino acid sequence in a set; (b) comparing the 
distance between the query sequence comparison signature 
and the sequence comparison signatures for other amino 

10 acid sequences in the set, wherein the sequence 

comparison signatures for other amino acid sequences in 
the set are clustered into pharmacof ami lies; and (c) 
identifying a proximal cluster having one or more 
sequence comparison signature that has a closer distance 

15 to the query sequence comparison signature than the 
sequence comparison signatures of a distal cluster, 
thereby identifying the sequences having the query 
sequence comparison signature as being a member of the 
pharmacof amily for the proximal cluster, wherein the 

20 pharmacofamilies for the proximal and distal clusters 
belong to the same ligand binding family. 

Further provided by the invention is a method 
for constructing a conformer model. The method includes 
the steps of: (a) determining a sequence comparison 

25 signature for each amino acid sequence in a set of amino 
acid sequences, wherein the sequence comparison signature 
includes pairwise comparison scores for the amino acid 
sequence compared to each of the other amino acid 
sequences in the set; (b) constructing a distance 

3 0 arrangement including the sequence comparison signatures 
related according to the distance between each of the 
sequence comparison signatures; (c) identifying separate 
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clusters of sequence comparison signatures in the 
distance arrangement, wherein the separate clusters 
include sequence comparison signatures for amino acid 
sequences in the same ligand binding family and separate 
5 pharmacof amilies; (d) determining bound conformations of 
the ligand bound to the members of a pharmacof amily; and 
(e) constructing an average structure of the bound 
conformations, wherein the average structure is a 
conformer model of the ligand. 

10 An average structure of the bound conformations 

of a ligand in a cluster can be determined by a variety 
of methods known in the art. For example, an average 
structure can be determined by overlaying bound 
conformations, or portions thereof, and identifying an 

15 average location for each atom. Bound conformations in a 
group to be averaged can be overlayed relative to a 
single member or relative to a centroid position for each 
atom. Algorithms for determining an average structure 
are known in the art and include for example the OVERLAY 

20 routine in INSIGHT98 (Molecular Simulations Inc., San 
Diego CA) . 

The format of a ligand conformer model can be 
chosen based on the method used to generate the model and 
the desired use of the model. In this regard, a 

25 conformer model can be represented as a single structure. 
The resulting structure can be a unique structure 
compared to the conformations of the ligand bound to 
polypeptides in a cluster from which it was derived. 
Thus, the conformer model can be a new structure never 

3 0 before observed in nature. A model represented by a 
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single structure can be useful for making visual 
comparisons by overlaying other structures with the 
model. A conformer model can also be represented as a 
plurality of structures incorporating all or a subset of 
5 the bound conformations of the ligand bound to 

polypeptides in a cluster. A model represented by 
multiple structures can be useful for identifying a range 
of minor deviations in the model. 

In yet another representation, the conformer 
10 model can be a volume surrounding all or a subset of the 
bound conformations of a ligand bound to polypeptides in 
a cluster. A model showing volume can be useful for 
comparing other structures in a fitting format such that 
a structure which fits within the volume of the model can 
15 be identified as substantially similar to the model. One 
approach that can be used to fit a structure to a volume 
is comparison of equivalent surface patches using 
gnomonic projection as described for example in Chau and 
Dean, J. Mol. Graphics 7:130 (1989). Use of a gnomonic 

2 0 projection to compare structures is also described in 

Doucet and Weber, Computer-Aided Molecular Design: Theory 
and Applications , Academic Press, San Diego CA (1996) . 
Algorithms which can be used to fit a structure to a 
volume are known in the art and include, for example, 
25 CATALYST (Molecular Simulations Inc., San Diego, CA) and 
THREEDOM which is a part of the INTERCHEM package which 
makes use of an Icosahedral Matching Algorithm (Bladon, 
J. Mol. Graphics 7:130 (1989) for the comparison and 
alignment of structures. Methods of identifying a 

3 0 binding compound by searching a database of structures 

using a gnomonic projection are described, for example, 
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in U.S. patent application number 09/753,020, which is 
hereby incorporated by reference. 

A conformer model can be useful in querying a 
5 database of polypeptide structures to find other members 
of a polypeptide pharmacof amily . For example, a member 
of a polypeptide pharmacof amily can be identified by 
querying a database of bound conformations of a ligand to 
identify a retrieved bound conformation of a ligand that 

10 is substantially similar to the query structure, thereby 
identifying a polypeptide bound to the retrieved bound 
conformation as a member of the same pharmacof amily as a 
polypeptide bound to the query bound conformation. A 
conformer model can also be used to identify a new member 

15 of a polypeptide pharmacof amily by querying a database of 
one or more polypeptide structures using an algorithm 
that docks the conformer model, wherein a favorable 
docking result with a retrieved polypeptide indicates 
that the retrieved polypeptide is a member of the same 

2 0 polypeptide pharmacof amily as a polypeptide bound to the 

bound conformation used as a query. In the latter mode, 
a potential new member of a pharmacof amily from which the 
conformer model was derived can be identified. The 
database queries described above can be performed with 
25 algorithms available in the art including, for example, 
THREEDOM and CATALYST. Membership can be confirmed by 
using sequence based clustering methods described above 
for a sequence comparison signature of the amino acid 
sequence of the new member compared to amino acid 

3 0 sequence of other members of the group. 
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An advantage of the invention is that a 
conformer model can be used to identify a binding 
compound that is specific for polypeptides of a 
pharmacofamily . For example, the conformer model can be 
compared to a structure of a compound or to a bound 
conformation of a ligand to identify those having similar 
conformation. A conformer model can be further used to 
query a database of compounds to identify individual 
compounds having similar conformations. 

A conformer model of the invention can also be 
used to design a binding compound that is specific for 
polypeptides of one or more pharmacof ami lies . The 
methods of the invention provide a conformer model that 
can be produced according to a cluster of bound 
conformations of a ligand that are specific for 
polypeptides of a pharmacofamily. A conformer model 
identified by these criteria can be used as a scaffold 
structure for developing a compound having enhanced 
binding affinity or specificity for polypeptides of a 
pharmacofamily. Such a scaffold can also be used to 
design a combinatorial synthesis producing a library of 
compounds which can be screened for enhanced binding 
affinity for polypeptide members of a pharmacofamily or 
specificity for polypeptide members of one pharmacofamily 
compared to polypeptide members of another 
pharmacofamily. An algorithm can be used to design a 
binding compound based on a conformer model including, 
for example, LUDI as described by Bohm, J. Comput. Aided 
Mol. Pes. 6:61-78 (1992). 
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A conformer model can include a portion of 
atoms in the bound conformations of a ligand bound to 
members of a pharmacof amily so long as the portion 
consists of contiguous atoms of a bound conformation of a 
5 ligand and provides sufficient information to distinguish 
one pharmacof amily from another. Thus, a conformer model 
can be constructed by overlaying corresponding fragments 
of bound conformations of a ligand and obtaining an 
average structure according to the methods described 

10 above. A conformer model made from a portion of a ligand 
can be advantageous due to its small size compared to a 
complete structure of the ligand from which it was 
derived. A conformer model based on a portion of a bound 
conformation of a ligand can also be used to more 

15 efficiently and rapidly query a database due to a reduced 
use of computer memory compared to the memory required to 
manipulate and store a structure containing all atoms of 
the ligand. 

The invention provides a method for 
20 constructing a pharmacophore model. The method includes 
the steps of: (a) determining a sequence comparison 
signature for each amino acid sequence in a set of amino 
acid sequences, wherein the sequence comparison signature 
includes pairwise comparison scores for the amino acid 
25 sequence compared to each of the other amino acid 
sequences in the set; (b) constructing a distance 
arrangement including the sequence comparison signatures 
related according to the distance between each of the 
sequence comparison signatures; (c) identifying separate 
30 clusters of sequence comparison signatures in the 
distance arrangement, wherein the separate clusters 
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include sequence comparison signatures for amino acid 
sequences in the same ligand binding family and separate 
pharmacof ami lies; (d) comparing the bound conformations 
of the ligand bound to members of one of the 
5 pharmacof amilies; (e) identifying one or more 

conformation- dependent properties of the ligand bound to 
members of one of the pharmacof amilies ; and (f) 
constructing a pharmacophore model that contains the one 
or more conformation-dependent properties. 

10 A pharmacophore model can be any representation 

of points in a defined coordinate system that correspond 
to positions of atoms in a bound conformation of a 
ligand. For example, a point in a pharmacophore model 
can correlate with the center of an atom in a conformer 

15 model. An atom of a conformer model can also be 

represented by a series of points forming a line, plane 
or sphere. A line, plane or sphere can form a geometric 
representation designating, for example, shape of one or 
more atoms or volume occupied by one or more atoms. 

2 0 A pharmacophore model can be represented in any 

coordinate system including, for example, a 2 dimensional 
Cartesian coordinate system or 3 dimensional Cartesian 
coordinate system. Other coordinate systems that can be 
used include a fractional coordinate system or reciprocal 

25 space such as those used in crystallographic calculations 
which are described in Stout and Jensen, supra. 

In addition to a geometric description of a 
bound conformation of a ligand, a pharmacophore model can 
include other characteristics of atoms or moieties of the 
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ligand including, for example, charge or hydrophobicity . 
Thus, a pharmacophore model can be a generalized 
structure, which includes but does not unambiguously 
describe the bound conformations of the ligand bound to 
5 the polypeptides in the pharmacof amily from which it was 
derived. For example, atoms can be represented as units 
of charge such that an oxygen in a bound conformation of 
a ligand can be represented by an electronegative point 
L;; in the pharmacophore model. In this example, the 

O 10 electronegative point in the pharmacophore model includes 

r 5 ! any electronegative atom at that particular location 

FU including, for example, an oxygen or sulfur. 

U1 A pharmacophore model can be constructed to 

y ; include, in addition to characteristics of the ligand 

15 itself, characteristics of an atom or moiety that 
Li interacts with the ligand and from a bound polypeptide. 

Characteristics of an interacting polypeptide atom or 
moiety that can be included in a pharmacophore model 
including, for example, atomic number, volume occupied, 
2 0 distance from an atom of the ligand, charge, 

hydrophobicity, polarity, or location relative to the 
ligand. Methods for constructing a pharmacophore model 
to include interacting atoms from a polypeptide are 
provided in U.S. patent application number 09/753,020, 
25 which is hereby incorporated by reference. 



A characteristic included in a pharmacophore 
model can be incorporated into a geometric representation 
using any additional representation that can be 
correlated with the characteristic. For example, use of 
3 0 color or shading can be used to identify regions having 
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characteristics such as charge, polarity, or 
hydrophobicity. As such, the depth of shading or color 
or the hue of color can be used to determine the degree 
of a characteristic. By way of example, a common 
5 convention used in the art is to identify regions of 
increased positive charge with deeper shades of blue, 
areas of increased negative charge with deeper shades of 
red and neutral regions with white. Numeric 
representations can also be used in a pharmacophore model 
10 including, for example, values corresponding to potential 
energy for an interaction, or degree of polarity. 

In addition, a pharmacophore model can 
incorporate constraints of a physical or chemical 
property of the bound conformations of a ligand bound to 
members of a pharmacof amily . A constraint of a physical 
property can be, for example, a distance between two 
atoms, allowed torsion angle of a bond, or volume of 
space occupied by an atom or moiety. A constraint of a 
chemical property can be, for example, polarity, van der 
Waals interaction, hydrogen bond, ionic bond, or 
hydrophobic interaction. Such constraints can be 
included in a pharmacophore model using the 
representations described above. 

A pharmacophore model can include bound 

2 5 conformations of a ligand bound to members of 2 or more 

pharmacof amilies . Such a pharmacophore model can be used 
to identify a ligand having broad specificity for two or 
more polypeptide pharmacof amilies . Additionally, in 
order to identify a ligand that can preferentially bind a 

3 0 first polypeptide which belongs to a first polypeptide 



15 



20 
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pharmacofamily compared to a second polypeptide of a 
second polypeptide pharmacofamily, a pharmacophore model 
can incorporate constraints on geometry or any other 
characteristic so as to exclude a characteristic of the 
bound conformation of the ligand bound to the second 
polypeptide. For example, a geometric constraint can be 
a forbidden region for one or more atom of a bound 
conformation of a ligand. A forbidden region can be 
identified by overlaying two conformer models in a 
coordinate system and identifying a coordinate or set of 
coordinates differentially occupied by one or more atoms 
of the conformer models. A pharmacophore model 
incorporating a forbidden region as such will be specific 
for a polypeptide of one pharmacofamily over a 
polypeptide of a second pharmacofamily correspondent with 
the constraint incorporated. 

An advantage of the invention is that a 
pharmacophore model can be created based on multiple 
structures of the same ligand. In comparison to a 
pharmacophore model derived from a single structure or 
different ligands, a pharmacophore model derived from 
multiple bound conformations of the same ligand can 
include a greater degree of geometric information. For 
example, averaging of multiple bound conformations of the 
same ligand can provide torsion angle constraints that 
are not available from a single structure and not evident 
from comparing different ligands. 

A conformation- dependent property can be 
identified as any property that correlates with a bound 
conformation of a ligand such that a change in the bound 
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conformation results in a change in the conformation- 
dependent property. Accordingly, a bound conformation of 
a ligand, or a portion thereof, can be a conformation- 
dependent property. A portion of a bound conformation of 
a ligand can be a contiguous fragment or a non- contiguous 
set of atoms or bonds. A bound conformation of a ligand, 
or portion thereof, can be identified by any method for 
determining the three dimensional structure of a ligand 
including as disclosed herein. 

Other conformation-dependent properties 
include, for example, absorption and emission of heat, 
absorption and emission of electromagnetic radiation, 
rotation of polarized light, magnetic moment, spin state 
of electrons, or polarity, as disclosed herein, or other 
properties that can be identified as a spectroscopic 
signal. Methods known in the art for measuring changes 
in absorption and emission of heat that correlate with 
changes in bound conformation of a ligand include, for 
example, calorimetry. Methods known in the art for 
measuring changes in absorption and emission of 
electromagnetic radiation as they correlate with changes 
in bound conformation of a ligand include, for example, 
UV/VIS spectroscopy, fluorimetry, luminometry, infrared 
spectroscopy, Raman spectroscopy, resonance Raman 
spectroscopy, X-ray absorption fine structure 
spectroscopy (XAFS) and the like. A change in a bound 
conformation of a ligand that is correlated with a change 
in rotation of polarized light can be measured with 
circular dichroism spectroscopy or optical rotation 
spectroscopy. A change in magnetic moment or spin state 
of an electron that correlates with a change in a bound 
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conformation can be measured, for example, with Electron 
paramagnetic resonance spectroscopy (EPR) or nuclear 
magnetic resonance spectroscopy (NMR) . 

When based on NMR data, a conf ormation- 
5 dependent property can be identified as an NMR signal 
including, for example, chemical shift, J coupling, 
dipolar coupling, cross-correlation, nuclear spin 
relaxation, transferred nuclear Overhauser effect, and 
any combination thereof. A conf ormat ion -dependent 

10 property can be identified by NMR methods in both fast 
and slow exchange regimes. For example, in many cases, 
the exchange rate of a complex between ligand and 
polypeptide is faster than the ligand spin relaxation 
rate (l/T 1H ) . In this situation, referred to as the "fast 

15 exchange regime," transferred nuclear Overhauser effect 
(NOE) experiments can be performed to measure an intra- 
ligand proton-proton distance (Wuthrich, NMR of proteins 
and Nucleic Acids , Wiley, New York (1986) and Gronenborn, 
J . Macrn. Res. 53:423-442 (1983)). Labeling of 

2 0 polypeptides is not required, and the ligand polypeptide 

concentration ratio can be adjusted to minimize line 
broadening of the ligand resonances while retaining 
strong NOE contribution from the bound form. 

In a fast exchange regime, cross-correlated 
25 relaxation measurements can also provide structural 

information on ligand torsion angles (Carlomagno et al . , 
J. Am. Chem Soc . 121:1945-1948 (1999)). These 
measurements include the 41 -41 dipole-dipole cross- 
correlation but can be extended to other cross -correlated 

3 0 relaxation mechanisms involving also homo- and 



heteronuclear chemical shielding anisotropy relaxation, 
as well as quadrupolar relaxation. For most of these 
heteronuclear experiments, the natural abundance of the 
isotope can be exploited. In cases where natural 
abundance of the isotope measured is not sufficient, 
isotope enriched ligands can be obtained from commercial 
sources such as Isotek (Miamisburg, OH) or Cambridge 
Isotope Laboratories (Andover, MA) or prepared by methods 
known in the art . Another method to determine a 
conformation- dependent property of a ligand in a fast 
exchange regime is use of residual homo- and 
heteronuclear dipolar couplings in partially aligned 
samples (Tolman et al . Proc . Natl. Acad. Sci. USA 
92:9279-9283 (1995) ) . 

In the slow exchange regime, the NMR signals 
arising from the bound conformation of the ligand are 
distinguished from those of the polypeptide to reduce 
resonance overlap. This can be achieved with different 
isotope labeling schemes of polypeptide, ligand or both. 
For large systems, perdeuteration of macromolecules and 
TROSY-type experiments (Pervushkin, Proc. Natl. Acad. 
Sci. USA 94:12366-12371 (1997)) can be used to minimize 
signal losses due to fast transverse relaxation of the 
resonances of the complex. With the appropriate sample 
requirements and isotope filtered experiments, cross- 
correlations, cross-relaxations and residual dipolar 
couplings can be measured and provide necessary 
structural information . 

In addition, homo- and heteronuclear two and 
three bond J couplings can be obtained to provide 



information on torsion angles (Wuthrich, supra) . For 
example, the bound conformations of NADP bound to members 
of different pharmacof amilies can differ by a torsion 
angle defined by the atoms PN-05 'N-C5 'N-C4 'N as described 
in U.S. Patent Application Number 09/753,020, which is 
hereby incorporated by reference. These torsion angles 
can be measured and distinguished by measuring the three 
bond 31 P- 13 C4' J coupling constants that correspond to this 
torsion angle (Marino, Acc. Chem. Res. 32:614-623 
(1999)). Basically, two ^-"C correlation experiments 
can be performed with and without 31 P decoupling during 13 C 
evolution. The intensity ratio of the X H 4 ■ / 13 C4 1 cross 
peak from each experiment is proportional to the 31 P- 13 C4 ■ 
J coupling constant. 

Correlation of a conformation- dependent 
property with a bound conformation of a ligand can be 
achieved by any method that has sufficient sensitivity to 
detect changes that correlate with changes in bound 
conformation of a ligand. Such a correlation can be 
determined by measuring a conformation-dependent property 
for various conformations of a ligand and determining the 
extent of change in the signal with change in the 
conformation. Signal changes that correlate with changes 
in conformation and that are detectable with a signal to 
noise ratio accepted in the art as significant can be 
used in the invention. 

Correlation between a conformation-dependent 
property and a conformation can be determined for a 
ligand bound to any partner so long as binding is 
specific and stable. For example, for purposes of 
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establishing a correlation, changes in a conformation 
dependent property that correlate with changes in bound 
conformation of a ligand can be determined for a ligand 
bound to polypeptides from different polypeptide 
5 pharmacof amilies . A bound conformation of the ligand in 
each complex can be determined and a conformation- 
dependent property can be measured for each complex. 
Comparison of bound conformations of the ligand in each 
complex with a measured conformation-dependent property 

10 can be used to establish a correlation. Demonstration of 
a method for establishing a correlation between an NMR 
signal and bound conformations of a ligand is described 
in U.S. patent application number 09/753,02 0, which is 
hereby incorporated by reference. Other methods for 

15 correlating spectroscopic signals with bound 

conformations of a ligand are known in the art including, 
for example, correlation of transferred NOE signals with 
anti and syn conformations of the nicotinamide ring in 
NADPH as described in Sem and Kasper Biochemistry 

20 31:3391-3398 (1992). Correlation of transferred NOE 

signals with conformation is also described in Clore and 
Gronenborn, J. Magn. Reson. 48:402-417 (1982). 

A correlation between a bound conformation and 
a conformation-dependent property can also be established 

25 for a ligand bound to a non-polypeptide binding partner 
because a conformation-dependent property of a ligand can 
be independent of interactions that differ between 
binding partners so long as the ligand is in the same 
bound conformation when bound to the binding partners. 

3 0 Other binding partners include, for example, nucleic 
acids, carbohydrates, and synthetic organometallic 
complexes . 
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The invention further provides a method for 
predicting the bound conformation of a ligand bound to 
polypeptide. The method includes the steps of: (a) 
determining a query sequence comparison signature for an 
5 amino acid sequence, wherein the query sequence 

comparison signature includes pairwise comparison scores 
for the amino acid sequence compared to each amino acid 
sequence in a set; (b) comparing the distance between the 
query sequence comparison signature and the sequence 

10 comparison signatures for other amino acid sequences in 
the set, wherein the sequence comparison signatures for 
other amino acid sequences in the set are clustered into 
pharmacofamilies; (c) identifying a proximal cluster 
having one or more sequence comparison signature that has 

15 a closer distance to the query sequence comparison 

signature than the sequence comparison signatures of a 
distal cluster, thereby identifying the sequences having 
the query sequence comparison signature as being a member 
of the pharmacofamily for the proximal cluster, wherein 

20 the pharmacofamilies for the proximal and distal clusters 
belong to the same ligand binding family; and (d) 
obtaining a pharmacophore model of the ligand bound to 
the pharmacofamily for the proximal cluster, wherein the 
pharmacophore model includes a prediction of the bound 

2 5 conformation for the ligand bound to the amino acid 

sequence having the query sequence comparison signature. 

A pharmacophore model can be useful in querying 
a database of polypeptide structures to find other 
members of a polypeptide pharmacofamily. For example, a 

3 0 member of a polypeptide pharmacofamily can be identified 

by querying a database of bound conformations of a ligand 
to retrieve a structure that fits the constraints of the 



query pharmacophore model, thereby identifying the 
retrieved polypeptide as a member of the pharmacof amily 
from which the pharmacophore model was derived. A 
pharmacophore model can also be used to identify a new 
5 member of a polypeptide pharmacof amily by querying a 
database of one or more polypeptide structures using an 
algorithm that docks or compares the pharmacophore model 
to polypeptide structures, wherein a favorable docking or 
comparison identifies a polypeptide as a member of the 

10 same polypeptide pharmacof amily from which the 

pharmacophore model was derived. The database queries 
described above can be performed with algorithms 
available in the art including, for example, THREEDOM and 
CATALYST . Membership can be confirmed by using sequence 

15 based clustering methods described above for a sequence 
comparison signature of the amino acid sequence of the 
new member compared to amino acid sequence of other 
members of the group. 

An advantage of the invention is that a 
20 pharmacophore model can also be used to identify a 

binding compound that is specific for polypeptides of one 
or more pharmacof ami lies . For example, a pharmacophore 
model can be compared to a structure of a compound or to 
a bound conformation of a ligand to identify those having 
25 similar properties. A conformer model can be further 
used to query a database of compounds to identify 
individual compounds having similar properties. 

A pharmacophore model of the invention can also 
be used to design a binding compound that is specific for 
30 polypeptides of one or more pharmacof amilies . A 



pharmacophore model identified by these criteria can be 
used as a scaffold or set of constraints for developing a 
compound having enhanced binding affinity or specificity 
for polypeptides of one or more pharmacof amilies . Using 
similar methods a pharmacophore model can be used to 
design a combinatorial synthesis producing a library of 
compounds having properties consistent or similar to the 
model which can be then be screened for enhanced binding 
affinity or specificity for polypeptide members of one or 
more pharmacof amilies . An algorithm can be used to 
design a binding compound based on a pharmacophore model 
including, for example, LUDI as described by Bohm, J_±_ 
Comput. AiriPd Mol . Pes. 6:61-78 (1992). 

A compound can be identified as satisfying the 
constraints of a pharmacophore model by a variety of 
methods for comparing structures. For example, a 
pharmacophore model that is a geometric representation 
such as a conformer model can be overlaid with a 
compound, and the best fit determined as described 
herein. Substantial overlap between a compound and a 
pharmacophore model can be indicated by a visual 
comparison and/or computation based comparison based on 
for example, RMSD values or torsion angle values as 
described above. In a case where a pharmacophore model 
is represented by constraints, a compound can be fitted 
to the pharmacophore model to identify if the properties 
of the compound satisfy the constraints of the 
pharmacophore model. For example, if a pharmacophore 
model contains, as a constraint, a maximum distance 
between atoms, a compound that satisfies the constraint 
can be identified as having a bond distance between 
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corresponding atoms that is at least the maximum value. 
One skilled in the art will know how to extend such 
methods of comparison to any physical or chemical 
constraint . 

5 A compound can also be identified as satisfying 

the constraints of a pharmacophore model by demonstrating 
the same characteristics for one or more specific atom 
located within a volume of space defined by the geometric 
constraints of the pharmacophore model. For example, in 

10 a case where polarity is a constraint and where a 
conformation of a compound can be overlaid with a 
pharmacophore model, an atom that overlaps a volume of 
space indicated by the pharmacophore and having polarity 
within the defined limits can be identified as satisfying 

15 constraints of the pharmacophore. By extension, a 

compound having atoms which satisfy all constraints of a 
pharmacophore is identified as a binding compound for one 
or more members of a polypeptide pharmacof amily from 
which the pharmacophore was produced. 

2 0 Furthermore, the invention provides a method 

for predicting the three-dimensional structure of a 
polypeptide. A subset of polypeptides to which a query 
sequence belongs can be identified as described above. A 
polypeptide having a sequence comparison signature in the 

2 5 same cluster as the sequence comparison signature for the 

query polypeptide and for which a three-dimensional 
structural model has been determined can be identified 
and the three dimensional structural model used as a 
template to construct a three dimensional model of the 

3 0 query polypeptide. For example, such a method can 
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include the steps of: (a) determining a query sequence 
comparison signature for an amino acid sequence, wherein 
the query sequence comparison signature includes pairwise 
comparison scores for the amino acid sequence for a query 

5 polypeptide compared to each amino acid sequence in a 
set; (b) comparing the distance between the query 
sequence comparison signature and the sequence comparison 
signatures for other amino acid sequences in the set, 
wherein the sequence comparison signatures for other 

0 amino acid sequences in the set are clustered into two or 
more subsets; (c) identifying a proximal cluster having 
one or more sequence comparison signature that has a 
closer distance to the query sequence comparison 
signature than the sequence comparison signatures of a 

5 distal cluster, thereby identifying the sequences having 
the query sequence comparison signature as being a member 
of the subset for the proximal cluster; (d) identifying a 
polypeptide having a sequence comparison score in the 
proximal cluster and a three-dimensional structure model; 

0 and (e) producing a structural model of the query 

polypeptide using the three-dimensional structure model 
as a template. 

A variety of methods are known in the art for 
modeling the three dimensional structure of a polypeptide 
5 according to the amino acid sequence of the polypeptide 
and a structure of a second polypeptide used as a 
template. Available algorithms include, for example, 
GRASP (Nicholls, A., supra), ALADDIN (Van Drie et al . 
supra) , INSIGHT98 (Molecular Simulations Inc., San Diego 
0 CA) , RASMOL (Sayle et al . , Trends Biochem Sci . 20:374-376 
(1995)) and MOLMOL (Koradi et al . , J. Mol . Graphics 
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14:51-55 (1996 )), Construction of a homology model for 
a polypeptide based on a template identified by the 
sequence based clustering methods of the invention is 
demonstrated in Example III. 

5 A model of a polypeptide determined by the 

methods of the invention can be useful for identifying a 
function of the polypeptide. For example, residues of a 
polypeptide that are involved in binding can be 
identified using a model of the invention. Residues 

10 identified as participating in binding can be modified, 
for example, to engineer new functions into a 
polypeptide, to reduce an intrinsic activity of a 
polypeptide, or to enhance an intrinsic activity of a 
polypeptide. In another example, a model of a 

15 polypeptide can be compared to other polypeptide 

structures to identify similar functions. Exemplary 
functions that can be identified from a polypeptide 
structure include binding interactions with other 
polypeptides and catalytic activities. 

20 The following examples are intended to 

illustrate but not limit the present invention. 

EXAMPLE I 

Sequence-Based Clustering of Polypeptides 

25 This example describes methods for grouping 

polypeptides into classes of overall fold and similar 
characteristics in their binding sites based on 
relationships identified by comparing their amino acid 
sequences . 
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Each polypeptide in a set of 15 amino acid 
sequences was characterized by a string of scores that 
described its sequence similarity to every other sequence 
in the data set. The string of scores constitutes a 
5 descriptor or property of the polypeptide. Figure 1 

shows comparison scores for 15 sequences. The scores of 
Figure 1 are percent identity scores that have been 
binned into 10 different groups and were computed using 
BLAST 2.1.2 from NCBI as described in Nicholas et al., 

10 Biotechnicrues 28:1174-1191 (2000). The values were 
binned into 10 groups, from 0 to 9, where sequences 
having a pairwise identity of less than 10% are binned 
into 0, those with 11-20% identity are binned into 1 and 
so forth up to bin 9 which contains those with identity 

15 scores of 91-100%. Accordingly, 9 indicates that the 
sequences are highly similar or identical, while 0 
indicates there is no similarity between the two 
sequences. The sequence comparison signature for 
Sequence 1 was (9,0,0,0,0,0,0,0,0,5,2,1,0,0,0), and the 

2 0 sequence comparison signature for Sequence 2 was 

(0,9,1,6,0,3,0,3,1,0,0,0,0,0,0). 

A comparison matrix was created by measuring 
the Euclidian distance between each of the sequence 
25 comparison signatures shown in Figure 1. The Euclidean 
distances were measured as described in Manley, 
Multivariate Statistical Methods, a Primer , Chapman Hall 
1994. Groups among the 15 sequences were defined using a 
divisive hierarchical clustering algorithm as described 

3 0 in Kaufman and Rousseeuw, Finding Groups in Data: An 

introduction to Cluster Analysis John Wiley and Sons, New 
York (1990) . Figure 2 shows a graphical representation 
of the sequence comparison signatures rearranged such 



that three clusters are apparent. The first cluster 
included Sequences 1, 10, 11 and 12; the second cluster 
included Sequences 8, 7, 4, 3, 2 and 5; and the third 
cluster includes Sequences 6, 14, 13, 9 and 15. 

A new sequence (Sequence 16) was compared to 
the clusters of Figure 2 to identify to which cluster it 
belonged. A sequence comparison signature was calculated 
for Sequence 16 compared to the 15 sequences of the set. 
Comparison of the sequence comparison signature for 
Sequence 16 to the other 15 sequence comparison 
signatures is shown in Figure 3 indicates that Sequence 
16 belongs to the second cluster. Therefore, the 
polypeptide having Sequence 16 is predicted to share 
structural features of the polypeptides with Sequences 8, 
7, 4, 3, 2 and 5, in particular at the binding site and 
to bind to the same bound conformation of a common ligand 
that binds at that site. 

EXAMPLE II 

Sequence-Based Clustering of NAD (P) -Binding Polypeptides 

This example demonstrates a sequence-based 
method for classifying polypeptides into separate 
pharmacofamilies that correlate with in-class binding to 
similar bound conformations of a ligand and cross-class 
binding to different bound conformations of the ligand. 

A database of NAD (P) utilizing enzymes was 
created primarily from sequences available in the Swiss- 
Prot Database. The Swiss -Prot database was found to 
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contain 4,613 sequences for polypeptides that utilize 
NAD (P) to perform their enzymatic functions, which 
represents approximately 4.7% of the sequences in the 
Swiss-Prot database. The database of NAD(P) utilizing 
5 enzymes included a variety of enzymes including NAD ( P ) - 
dependent oxidoreductases, NAD(P) synthetases, ADP 
ribosylating toxins, NAD-dependent ligases, poly-ADP 
ribose polymerases, and NAD (P) -dependent deacetylases . 

A comparison matrix was calculated for 
10 sequences in the database of NAD (P) utilizing enzymes and 
clusters were identified as described in Example I. 
Sequence comparison scores were calculated by the BLAST 
algorithm in part because it is a relatively fast 
algorithm that is appropriate for rapidly characterizing 
15 large sequence datasets. Three pair-wise comparison 
metrics were evaluated using the common neighbor 
clustering approach: cluster analyses were performed that 
utilized either sequence identity scores, E-scores or 
bit-scores calculated by BLAST. While each strategy 
20 yielded similar results, cluster analysis using Blast 
bit-scores yielded 120 sequence groups, E-score yielded 
135 sequence groups, and cluster analysis utilizing 
sequence identity scores yielded 94 sequence groups. The 
differences in the number of sequence groups identified 
25 for each strategy arise from division of groups derived 
from clustering by sequence identities into subgroups 
when clustered on bit-scores or E-scores. Bit scores and 
E-scores appear to cluster sequences into a larger number 
of families that display greater sequence homology 
3 0 compared to sequence clusters derived from pairwise 
sequence identities . 
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Because the 94 sequence clusters identified 
using sequence identity scores appeared to correlate with 
structure and pharmacof amily , this set of sequences was 
utilized for further analysis. Table 1 shows a list of 
5 the 94 identified sequence families (SF) , where the 

number of sequences in each SF is provided (members) , as 
is the number of unique structures (the value 0 indicates 
absence of a known structure) and catalytic functions 
(enzyme classifications as provided in the Expasy 

10 database) . Also shown for each SF are the identity of 
cof actors bound by the members, identity of the NAD(P) 
pharmacophore common to the members of the SF, and a 
description or name of an exemplary enzyme in each SF. 
The clustering procedure segregated sequence clusters 

15 that belong to the NAD (P) ubiquinone oxidative complex, 
sequence clusters that correspond to enzymes catalyzing 
oxidation or reduction of a substrate, and sequence 
clusters that catalyze non-redox chemistries such as 
NAD (P) synthetases . 

2 0 Approximately half of the sequence clusters 

contained only one enzyme function, while others 
contained sequences representing as many as 3 8 different 
catalytic mechanisms. Many sequence clusters that 
contained multiple enzyme functions were related by 
25 mechanism or substrates or both. For example, sequence 
cluster 23, composed of the disulfide dehydrogenases, 
contained 25 enzyme mechanisms that all utilize a coupled 
NAD-FAD redox reaction to reduce disulfide bonds. In 
several cases, a single enzymatic function or a group of 

3 0 highly related enzyme functions was represented in 

multiple sequence groups. In particular, the alcohol 
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dehydrogenases (E.C. 1.1.1.1) were found in sequence 
clusters 1, 2, 20 and 50. 

Sequence clustering results correlated strongly 
with protein fold classifications. In each case where 

5 structural information was available for multiple members 
of a sequence family, each structure was related by a 
common NAD (P) binding protein fold. In general, 
structures in each sequence family correlated to a single 
SCOP polypeptide family describing the NAD (P) -binding 

0 domain. In two instances, sequence clusters correlated 
to multiple polypeptide folds classified by SCOP. 
However, in these two instances, structural folds for 
polypeptides in the clusters were very similar, 
particularly in the regions of the polypeptide that 

5 interact with the NAD (P) cofactor. 

A biologically relevant NAD (P) conformer subset 
was generated from NAD(P) conformations derived from 
structures of NAD(P) complexed to enzymes in the PDB 
database using the methods described in U.S. Patent 
0 Application Nos. 09/753,020 and 09/747,174, which are 
hereby incorporated by reference. Using the root mean 
square deviation of the NAD (P) atomic coordinates (rmsd) 
as a distance metric between conformers, the database was 
clustered into 16 NAD (P) pharmacof amilies . 

5 As shown in Table 1, a single pharmacof amily 

(NAD (P) pharmacophore) identified from comparison of 
bound ligand structures could be correlated with many of 
the sequence families (SF in Table 1) that were 
identified from sequence based comparisons. 



70 



O 
H 
4J 
Qa 
H 
Jh 
U 
0) 
CD 
P 





CO 




CQ 


cu 




(D 


co 




CQ 


rd 


Q 


rd 






£ 


0) 


U 


CD 


cd 


rd 


CD 


0 


CJ) 


o 






U 




CQ 


CJ 




i 


>i 


s 


Pu 




cu 


Q 


CD 




O 










rH 


rH 


0 


P 


0 






X\ 


0 


P-i 


V 


0 


rH 


rH 


P 









CD 

CO 

03 

-H 

0) 
-U 
CO 
>i 
O 

o 

O 

rH 
>i 
CO 

o 
a 

CD 



CD 

rd 

CD 
U 
>i 

rH 0) 

CD CO 

O rd 

& CD 

CO CD 

0 O 

CUT5 

CO r£ 

1 CD 
P P 





CD 




4J 


<D 


cd 


CO 




rd 






CO 


CD 


0 


CD 




O 


Pa 


^ 


1 




CO 




I 




CD 


CD 


X$ 




>i 




£j 


CD 


0) 






-H 


rH 




rd 


CD 




CO 


CD 


O 


cj 


B 




0 


rH ffi 




O P 





CQ 






CD 






CO 






rd 












CD 


U; 




CD 


4— > 


CD 


CJ 




4-) 


M 


i — 1 


cd 




r\ 
U 


rH 


Is. 


M— I 


rd 




O 


B CD 


CD 




CO 






\ rd 










A 


CD CD 


-H 


rd 


4-> CD 


O 


M CD 


CO CJ 


rd 


4J CQ 


4J M 




cd rd 




O 




rd >i 




1 4-) 


rH ^ 


•H 


h a 


I CD 




• >1 




i 


U CO 


c 

I -H 

O rd 


Dssman 
Fold 
1/2 


S 6 
H O 





!X 




P 




0) 




4-> 




rd 






0 


a, 


u 


CO 




0 


rH CD 




>i CO 


CU 


CJ rd 


i 


rd 


ro 


>i CD 


i 


X CD 


rH 


O O 


O 




rH 




CD 




a 






i CD 


rH 


ro 


O 



0) 
4J 

rd 
C 
0 
CJ CD 



3 

rH 



cd a 

O CD 

X! CD 

a. o 

CO M 
O 

i CD 



CO 
CD 

6 

N 

a 
w 

4J 

CD 

fi 

CD 
CU 
CD 
P 

I 

0 

CO 
CD 
-H 



rd 
fa 

CD 
O 

cs 

CD 

CD 
CO 



i CD 

* E o 
9 rd 

O 

3 CU u 



CO 
M 
O 

4J 

a 
rd 

M-l 

o 
u 



us 



CU 
123 



1 



CU 

p 
< 
a 



cu 

1 



cu 
p 

1 



cu 



cu 

a 



CJ 
-H 



t CO 

rd ; ; cj o 

rd H fa 4J 

a 



CM 



if) 



CM 



CM 



^ CJ CO 

CJ^ ^ CD 

-H CD U U 

C J-» ^ 

D CO 4J 



CM 



in 



H 



H 



CO 



CM 



CM 



I CQ 
g M 

CD CD 



CM 



ro 

CM 



ro 
m 

H 



O 
CM 



CM 



CD) 
LO 



rd 



fa 



CM 



ro 



LO 



oo 



CM 



ro 
H 



O 
H 



LO 
H 



71 



Enzyme Description 


Ketol-acid reductoisomerase 


Glucose-6-phosphate 1- 
dehydrogenase 


Dihydrodipicolinate 
reductase 


NADP -dependent malic enzyme 


Glucose- 6 -phosphate 1- 
dehydrogenase 


N- acetyl -y-glutamyl 
phosphate reductase 


Short -chain dehydrogenases 1 


Short -chain dehydrogenases 2 


NAD (H) -dependent FMN 
reductase 


Disulfide dehydrogenases 


Sulfite reductase 


Cyp450 reductase / 
Ferredoxin reductase / NO 
synthase 


NAD(P) 
Pharma- 
cophore 
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Enzyme Description 


Aldo-keto reductases 


Inosine-5 1 -monophosphate 
dehydrogenase 


Dihydrof olate reductase 


Isocitrate / 3- 
isopropylmalate 
dehydrogenase 


3 -hydroxy- 3 -methylglutaryl- 
CoA reductase 


Aldehyde dehydrogenases 


Gamma-glutamyl phosphate 
reductase 


NAD ( P ) H dehydrogenase 
(quinone) 


Shikimate 5 -dehydrogenase / 
Dehydroquinate synthase 
(multifunctional proteins) 


Histidinol dehydrogenase 


Glutamyl -tRNA reductases 


Light - independent 

protochlorophyllide 

reductase 


Deoxyhypusine synthase 


NAD (P) 
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cophore 
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Enzyme Description 


5 -amino -6 (5- 

phosphoribosylamino) uracil 
reductase j 


Malate dehydrogenase 


Manni tol - 1 -phosphate - 5 - 
dehydrogenase 


Acyl-CoA reductase 


Myo- inositol - 1 -phosphate 
synthase 


D-nopaline dehydrogenase 


Nitrate- inducible formate 
dehydrogenase 


Precorrin-6x reductase 


Phosphoadenos ine 
phosphosulf ate reductase 


Saccharopine dehydrogenase 


Pyrrol ine - 5 - carboxylat e 
reductase 


Oxygen- independent 
coproporphyr indgen III 
oxidase 


L-ornithine 5-monooxygenase 
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cophore 
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Enzyme Description 


Acyl- [ACP] desaturase 


Ornithine cyclodeaminase 


Mono -ADP-ribosyl transferase 
C3 precursor {botulinum) 


Pertussis toxin/Cholera 
enterotoxin precursor 


Exotoxin A Diptheria toxin 
precursor 


NAD ( P ) + - arginine ADP - 
ribosyltransf erase 


RNA 2 1 -phosphotransferase 
(ADP-ribosylated) 


Poly [ADP-ribose] polymerase 


ADP-ribosyl cyclase 


Farnesyl -diphosphate 
farnesyl transferase 


Nicotinamide -nucleotide 
adenylyl transferase 


NH (3) -dependent NAD+- 
synthase 


NADH pyrophosphatase 


NAD(P) 
Pharma- 
cophore 
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Enzyme Description 


Inorganic polyphosphate/ATP- 
NAD kin'ase 


Sir2 regulatory protein 


DNA ligase 


Phosphate glucosidases 


Alpha -glucosidase 


TRK system K + uptake protein 
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EXAMPLE III 

A Three -Dimensional Homology Model for the NADPH-Binding 
Domain of 1-Deoxy-D-xylulose 5-phosphate reductoisomerase 
Based on a Template Identified by Sequence -Based 
5 Clustering 

This example demonstrates use of sequence-based 
clustering to identify a template structure for homology 
modeling of 1-Deoxy-D-xylulose 5-phosphate 
reductoisomerase (DXPR) . This example further provides a 
10 homology model for the three dimensional structure of the 
amino terminal NADP-binding domain of DXPR. Validation 
of the model using nuclear magnetic resonance 
spectroscopy is also demonstrated. 

1-Deoxy-D-xylulose 5-phosphate reductoisomerase 
15 (DXPR) is an enzyme involved in isoprenoid biosynthesis, 
catalyzing the formation of 2-C-methyl-D-erythritol from 
1-deoxy-D-xylulose 5-phosphate (Takahashi et al . , Proc . 
Natl. Acad. Sci. USA 95:9879-9884 (1998)). The 
deoxyxylulose pathway, found in some bacteria, algae, 
2 0 plants and protozoa, is an alternate to the ubiquitous 
mevalonate pathway for isoprenoid biosynthesis 
(Eisenreich et al . , Trends Plant Sci. 6:78-84 (2001)). 
Because a three dimensional model of the DXPR structure 
was not 

25 available and to aid in the design of inhibitors of DXPR, 
a model for the NADPH- binding, N- terminal domain of the 
enzyme for E. coll was produced and validated as set 
forth below. 
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The E. coli DXPR amino acid sequence was used 
to search for homologs with BLAST and PSI -BLAST using 
default parameters. Neither algorithm identified 
homologous sequences below an E-score of 0.005 in the 
5 Swiss-Prot database (other than orthologues of DXPR) . 
Other methods such as SDSC1 (Shindyalov and Bourne, 
Fourth Meeting on the Critical Assessment of Techniques 
for Protein Structure Prediction , A-92 (2000)) and 3D- 
JIGSAW (Bates and Sternberg, Proteins : Structure , 
10 Function and Genetics Suppl . 3:47-54 (1999)) were also 
unable to identify homologues for potential use as 
templates. The threading server 3D-PSSM (Kelley et al., 
J . Mol. Biol. 299:499-520 (2000)), also did not identify 
any hits below a significant E-value. 

15 Sequence comparison signatures were determined 

for the NAD (P) -binding sequences (including 28 DXPR 
sequences) in the Swiss-Prot database [12] and clustering 
was performed as described in Examples I and II. The 28 
DXPR sequences formed one cluster. When visualized in a 

2 0 comparison matrix, the DXPR cluster was proximal to other 
clusters. These other clusters were composed of 
aspartate semialdehyde dehydrogenase, homoserine 
dehydrogenase, N-acetyl-g-glutamyl phosphate 
reductoisomerase, or glyceraldehyde 3 -phosphate 

25 dehydrogenase; all of which share a common NAD (P) -binding 
Rossmann fold. The proximity correlated with local 
sequence identity between DXPR sequences and sequences of 
t h ese other clusters, ranging from about 17 to 40% local 
sequence identity. Although the E-scores of these 

30 sequence identities were between 0.1 and 2.0, these 
clusters were identified as related groups because 
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multiple DXPR sequences systematically showed cross-talk 
to only the above mentioned sequence clusters. In 
particular, cross-talk was identified as low sequence 
identity (less than 30%) between the cluster containing 
5 DXPR and a few sequences belonging to other clusters, 
which showed a pattern that was distinct from a pattern 
observed in the cluster. The cross talk was 
distinguishable from true noise because in the case of 
noise, only a single DXPR sequence had low similarity to 
10 some other cluster. Based on these data, the NADP- 

binding domain of E. coli DXPR was predicted to contain a 
Rossmann fold. 

The local sequence identities between the 
sequences in the proximal clusters occurred in the N- 

15 terminal, NAD (P) -binding domain. In order to choose a 

template for homology modeling of the DXPR NAD (P) -binding 
domain, the sequences in the other clusters were 
evaluated according to their proximity to DXPR in the 
sequence comparison matrix and whether or not a 

2 0 structural model was available for members of the 
cluster. Homoserine dehydrogenase and aspartate 
semialdehyde dehydrogenase showed the most proximity to 
DXPR in the sequence comparison matrix. Of these two, a 
crystal structure was available for homoserine 

2 5 dehydrogenase . 

A multiple-alignment of E. coli DXPR with the 
NAD-binding domain of S. cerevisiae homoserine 
dehydrogenase was performed using Clustalw (Thompson et 
al., Nucl. Acids. Res. 22:4673-4680 (1994)). TheNAD- 
30 binding motif of E. coli DXPR (LGXTGSIG) aligned very 
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well with the NAD-binding motif of S. cerevisiae 
homoserine dehydrogenase (IGAGWGS) as shown in Figure 4. 
This alignment was used to build several models of E. 
coli DXPR using the MODELER module in MSI Insight II 
(Sali and Blundell, J. Mol . Biol. 234:779-815 (1993)). 
The model having the least coiling of loops was chosen 
and is shown in Figure 5, with some NADP- contact residues 
colored in blue (isoleucine) , black (methionine), and 
cyan (lysine) . The bound conformation of NAD from 
homoserine dehydrogenase is superimposed on the model and 
shown in green. 

The validity of the homology model was tested 
using nuclear magnetic resonance (NMR) spectroscopy. 
Based on proton chemical shifts, it was possible to 
observe changes in the chemical environment around NADPH 
and thereby determine which residues in the polypeptide 
were interacting with the coenzyme. Nuclear Overhauser 
Effect peaks (NOE's) observed between NADPH and residues 
in the binding pocket of E. coli DXPR were consistent 
with those in the homology model in that methionine, 
isoleucine and lysine residues were observed to be in 
proximity of the cof actor. Thus, the model satisfied the 
constrains observed by NMR spectroscopy. 

Throughout this application various 
publications have been referenced. The disclosures of 
these publications in their entireties are hereby 
incorporated by reference in this application in order to 
more fully describe the state of the art to which this 
invention pertains. 



81 

Although the invention has been described with 
reference to the examples provided above, it should be 
understood that various modifications can be made without 
departing from the spirit of the invention. Accordingly, 
the invention is limited only by the claims. 



